Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Your task is to implement a program in invertedIndex.c that reads data from a given collection of pages in collection.txt and generates an inverted index

Your task is to implement a program in invertedIndex.c that reads data from a given collection of pages in collection.txt and generates an "inverted index" that provides a sorted list (set) of URLs for every word in a given collection of pages.
One possible (and recommended) implementation of an inverted index is a binary search tree where each node contains a word and a corresponding file list, which is a linked list. Each node of the file list contains the filename of a file that contains the word. The inverted index tree must be ordered alphabetically (ascending) by word, and each file list must be sorted alphabetically (ascending) by filename.
The full inverted index will contain many inverted index nodes arranged in a binary search tree (ordered alphabetically by word), and each inverted index node will contain a file list. For example:
In this diagram, each of the blue linked lists contains the URLs of the pages which contain the corresponding word. Note that this is only a suggestion for how to implement this part. As long as your implementation works, and doesn't exceed the time limit of the autotests, you can receive full marks.
Before inserting words into your inverted index, you must normalise them by converting all letters to lowercase and removing the following punctuation marks from the end of the words:
.(dot)
,(comma)
: (colon)
; (semicolon)
?(question mark)
*(asterisk)
If there are multiple of the above punctuation marks at the end of a word, you should remove all of them. Punctuation marks at the start of a word or in the middle of a word should not be removed. Other punctuation marks should not be removed. If a word becomes empty after removing punctuation marks, then it should not be inserted into the inverted index.
Here are some examples of normalisation:
Word Normalised word
Data data
BSTs bsts
algorithms. algorithms
Why? why
graphs*. graphs
.NET .net
unsw.edu.au. unsw.edu.au
Sydney! sydney!
.,!.,:; .,!
new...........s new...........s
*(empty word)
Here is an example of how the program is run:
./invertedIndex
Your program should print an inverted index to a file named invertedIndex.txt. Each line of the output should begin with a word from the inverted index followed by a space-separated list of filenames. Lines should be ordered alphabetically (ascending) by the initial word, and each list of files should be ordered alphabetically (ascending) by filename.
Here is an example to demonstrate the expected format:
design url2 url25 url31 url61
mars url101 url25 url31
vegetation url31 url61
On each line, a word and all the URLs should be separated by one or more spaces. The testing program will ignore additional spaces.
When we say "alphabetical order", we mean as determined by strcmp.
All files required by the program (i.e., the collection file and all the files listed in the collection file) are valid readable/openable text files, and will be in the current working directory when the program is executed.
Words are at most 1000 characters in length.
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

T Sql Fundamentals

Authors: Itzik Ben Gan

4th Edition

0138102104, 978-0138102104

More Books

Students also viewed these Databases questions