Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this project, you will design very simple yet effective document retrieval system that will respond single word queries using C++. This system will allow

In this project, you will design very simple yet effective document retrieval system that will respond single word queries using C++. This system will allow users to enter single word, then the system will return list of documents containing this word.

Consider a senario where there are approximetly 10,000 text files, and you would like to develop a program that will enable users to search for specific keyword among all documents. Users are interested in documents containing this keyword.

Simple and first attempt to develop a solution to this problem would be to construct a program that first read keyword from user and then program will scan all files and list the files containing this keyword. This approach would be the most expensive method in terms of effciency and time.

Instead more efficinet method document retrieval based on users single word query is achieved by a technique called indexing. Document indexing in its simplest form refers to a means of organizing and storing documents for later retrieval based on words they contain.

Task: Design and implement a system that will index the documents by their content words considering linked list as a data structure. While reading text documents:

Get all words, where a word is a string of alpha characters terminated by a non-alpha character (white space is not alpha). The alpha characters are defined to be [a-z]. Therefore, the sequence of characters for the word apple+78&^+orange would be apple and orange.

Lowercase all words,

Filter out all the words that are in the stop words list, such as a, an, the. ("stop words" usually refers to the most common words in a language ). Why do you think this filtering is done? (Read: https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html ) List of stop words are provided on webonline.

Limitations and Assumptions

1. The collection of documents is closed (content and number of documents are fixed will never change).

2. Each document stored in a single text file.Hence if there are 10,000 documents, then there are 10,000 text files. (Collection of documents is given but ? cant upload it to here)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions