Question
C++ programming In this project, you will design very simple yet effective document retrieval system that will respond single word queries. This system will allow
C++ programming
In this project, you will design very simple yet effective document retrieval system that will respond single word queries. This system will allow users to enter single word, then the system will return list of documents containing this word.
Consider a senario where there are approximetly 10,000 text files, and you would like to develop a program that will enable users to search for specific keyword among all documents. Users are interested in documents containing this keyword.
Simple and first attempt to develop a solution to this problem would be to construct a program that first read keyword from user and then program will scan all files and list the files containing this keyword. This approach would be the most expensive method in terms of effciency and time.
Instead more efficinet method document retrieval based on users single word query is achieved by a technique called indexing. Document indexing in its simplest form refers to a means of organizing and storing documents for later retrieval based on words they contain.
Task: Design and implement a system that will index the documents by their content words considering linked list as a data structure. While reading text documents:
Get all words, where a word is a string of alpha characters terminated by a non-alpha character (white space is not alpha). The alpha characters are defined to be [a-z]. Therefore, the sequence of characters for the word apple+78&^+orange would be apple and orange.
Lowercase all words,
Filter out all the words that are in the stop words list, such as a, an, the. ("stop words" usually refers to the most common words in a language ). Why do you think this filtering is done? (Read: https:/lp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html ) List of stop words are provided on webonline.
Limitations and Assumptions
1. The collection of documents is closed (content and number of documents are fixed will never change).
2. Each document stored in a single text file. Hence if there are 10,000 documents, then there are 10,000 text files. (Collection of documents is provided on webonline)
Figure 1 and Figure 2 summarizes the aim of this project.
Figure 1
Figure 2: Documents are Indexed by their word contents.
Since we may not easly estimate the number of words among all documet, one possible solution idea would be to use linked lists to maintain list of words. Also for each word a list of files needs to be kept, the number of documents(or files) may not be estimated again, again the use of linked list is suggested.
stopwords File
a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves
_________________
-Docs-
Document Text I Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, Nine days old. 4 Some like it hot, some like it cold, 5Some like it in the pot, Nine days old. Example text; each line is one document. Document Text I Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, Nine days old. 4 Some like it hot, some like it cold, 5Some like it in the pot, Nine days old. Example text; each line is one document
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started