Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

C++ programming In this project, you will design very simple yet effective document retrieval system that will respond single word queries. This system will allow

C++ programming

In this project, you will design very simple yet effective document retrieval system that will respond single word queries. This system will allow users to enter single word, then the system will return list of documents containing this word.

Consider a senario where there are approximetly 10,000 text files, and you would like to develop a program that will enable users to search for specific keyword among all documents. Users are interested in documents containing this keyword.

Simple and first attempt to develop a solution to this problem would be to construct a program that first read keyword from user and then program will scan all files and list the files containing this keyword. This approach would be the most expensive method in terms of effciency and time.

Instead more efficinet method document retrieval based on users single word query is achieved by a technique called indexing. Document indexing in its simplest form refers to a means of organizing and storing documents for later retrieval based on words they contain.

Task: Design and implement a system that will index the documents by their content words considering linked list as a data structure. While reading text documents:

Get all words, where a word is a string of alpha characters terminated by a non-alpha character (white space is not alpha). The alpha characters are defined to be [a-z]. Therefore, the sequence of characters for the word apple+78&^+orange would be apple and orange.

Lowercase all words,

Filter out all the words that are in the stop words list, such as a, an, the. ("stop words" usually refers to the most common words in a language ). Why do you think this filtering is done? (Read: https:/lp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html ) List of stop words are provided on webonline.

Limitations and Assumptions

1. The collection of documents is closed (content and number of documents are fixed will never change).

2. Each document stored in a single text file. Hence if there are 10,000 documents, then there are 10,000 text files. (Collection of documents is provided on webonline)

Figure 1 and Figure 2 summarizes the aim of this project.

image text in transcribed

Figure 1

image text in transcribed

Figure 2: Documents are Indexed by their word contents.

Since we may not easly estimate the number of words among all documet, one possible solution idea would be to use linked lists to maintain list of words. Also for each word a list of files needs to be kept, the number of documents(or files) may not be estimated again, again the use of linked list is suggested.

image text in transcribed

stopwords File

a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves

_________________

-Docs-

image text in transcribed

Show transcribed image text

Document Text I Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, Nine days old. 4 Some like it hot, some like it cold, 5Some like it in the pot, Nine days old. Example text; each line is one document. Document Text I Pease porridge hot, pease porridge cold, 2 Pease porridge in the pot, Nine days old. 4 Some like it hot, some like it cold, 5Some like it in the pot, Nine days old. Example text; each line is one document

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Oracle Autonomous Database In Enterprise Architecture

Authors: Bal Mukund Sharma, Krishnakumar KM, Rashmi Panda

1st Edition

1801072248, 978-1801072243

More Books

Students also viewed these Databases questions

Question

=+1 What would you do if you were the IHR manager?

Answered: 1 week ago