Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Part 2 - Inverted Index Your task is to implement a program in invertedIndex.c that reads data from a given collection of pages in collection.txt
Part Inverted Index
Your task is to implement a program in invertedIndex.c that reads data from a given collection of pages in collection.txt and generates an "inverted index" that provides a sorted list set of URLs for every word in a given collection of pages.
One possible and recommended implementation of an inverted index is a binary search tree where each node contains a word and a corresponding file list, which is a linked list. Each node of the file list contains the filename of a file that contains the word. The inverted index tree must be ordered alphabetically ascending by word, and each file list must be sorted alphabetically ascending by filename.
The full inverted index will contain many inverted index nodes arranged in a binary search tree ordered alphabetically by word and each inverted index node will contain a file list. For example:
In this diagram, each of the blue linked lists contains the URLs of the pages which contain the corresponding word. Note that this is only a suggestion for how to implement this part. As long as your implementation works, and doesn't exceed the time limit of the autotests, you can receive full marks.
Before inserting words into your inverted index, you must normalise them by converting all letters to lowercase and removing the following punctuation marks from the end of the words:
dot
comma
: colon
; semicolon
question mark
asterisk
If there are multiple of the above punctuation marks at the end of a word, you should remove all of them. Punctuation marks at the start of a word or in the middle of a word should not be removed. Other punctuation marks should not be removed. If a word becomes empty after removing punctuation marks, then it should not be inserted into the inverted index.
Here are some examples of normalisation:
Word Normalised word
Data data
BSTs bsts
algorithms. algorithms
Why? why
graphs graphs
NET net
unsw.edu.au unsw.edu.au
Sydney! sydney!
:;
new...........s new...........s
empty word
Here is an example of how the program is run:
invertedIndex
Your program should print an inverted index to a file named invertedIndex.txt Each line of the output should begin with a word from the inverted index followed by a spaceseparated list of filenames. Lines should be ordered alphabetically ascending by the initial word, and each list of files should be ordered alphabetically ascending by filename.
Here is an example to demonstrate the expected format:
design url url url url
mars url url url
vegetation url url
On each line, a word and all the URLs should be separated by one or more spaces. The testing program will ignore additional spaces.
AssumptionsConstraints
Relevant assumptions from previous parts apply.
When we say "alphabetical order", we mean as determined by strcmp
All files required by the program ie the collection file and all the files listed in the collection file are valid readableopenable text files, and will be in the current working directory when the program is executed.
Words are at most characters in length.
program in c
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started