Question
C ++ (ASSUME THAT YOU HAVE 10.000 TEXT F?LE FOR SEARCHING) In this project, you will design very simple yet effective document retrieval system that
C ++ (ASSUME THAT YOU HAVE 10.000 TEXT F?LE FOR SEARCHING) In this project, you will design very simple yet effective document retrieval system that will respond single word queries. This system will allow users to enter single word, then the system will return list of documents containing this word.
Consider a senario where there are approximetly 10,000 text files, and you would like to develop a program that will enable users to search for specific keyword among all documents. Users are interested in documents containing this keyword.
Simple and first attempt to develop a solution to this problem would be to construct a program that first read keyword from user and then program will scan all files and list the files containing this keyword. This approach would be the most expensive method in terms of effciency and time.
Instead more efficinet method document retrieval based on users single word query is achieved by a technique called indexing. Document indexing in its simplest form refers to a means of organizing and storing documents for later retrieval based on words they contain.
Task: Design and implement a system that will index the documents by their content words considering linked list as a data structure. While reading text documents:
Get all words, where a word is a string of alpha characters terminated by a non-alpha character (white space is not alpha). The alpha characters are defined to be [a-z]. Therefore, the sequence of characters for the word apple+78&^+orange would be apple and orange.
Lowercase all words,
Filter out all the words that are in the stop words list, such as a, an, the.
Limitations and Assumptions
The collection of documents is closed (content and number of documents are fixed will never change).
Each document stored in a single text file. Hence if there are 10,000 documents, then there are 10,000 text files. (Collection of documents is provided on webonline)
Since we may not easly estimate the number of words among all documet, one possible solution idea would be to use linked lists to maintain list of words. Also for each word a list of files needs to be kept, the number of documents(or files) may not be estimated again, again the use of linked list is suggested.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started