Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 23, 2024

CSCI-2270 Assignment 7 Word Analysis HW 2 Converted to use a Hash Table with Separate Chaining Objectives Use a hashing function Store data in a

CSCI-2270

Assignment 7

Word Analysis

HW 2 Converted to use a Hash Table with Separate Chaining

Objectives

Use a hashing function

Store data in a chained hash table

Search for data in a hash table

Background

The background details are the same as Homework 2 Array Doubling -- but with one difference. This time you will implement a hash table using separate chaining with linked lists, and store the words in your hash tables rather than vectors.

What your program needs to do

Use the text files from HW 7, NOT HW 2. Download these files from moodle:

HW7-HungerGames_edit.txt

HW7-stopWords.txt

HashTable.hpp

Your program will be able to calculate the following information on any text file:

The top n words (excluding stopwords; n is also a command-line argument) and the number of times each word was found

The total number of unique words in the file (excluding stopwords)

The total number of words in the file (excluding stopwords)

Total number of collisions in a hash table.

Example:

Your program takes four command-line arguments: (1) the number of most common words to print out, (2) the name of the text file to process, (3) the name of the file containing the stopwords, and (4) and the size of your hash table to store the unique words in.

Running your program using:

./a.out 15 HW7-HungerGames_edit.txt HW7-stopWords.txt 53

would return the 15 most common words (excluding stopwords) in the file HW7-HungerGames_edit.txt, use a hash table of size 53, and should produce the following results:

682 - is

492 - peeta

479 - its

431 - im

427 - can

414 - says

379 - him

368 - when

367 - no

356 - are

324 - then

323 - back

319 - was

313 - just

306 - into

Number of collisions: 7628

Unique non-stop words: 7681

Total non-stop words: 59045

Program Specifications

The following are requirements for your program:

Your Hash Table must be implemented in a class based on the provided HashTable.hpp.

You must implement a driver program that utilizes your HashTable class. This driver program will contain your main function.

Include a comment block at the top of the .cpp files with your name, assignment number, date and course instructor.

Make sure your code is commented enough to describe what it is doing.

Driver Specifications:

You will create two instances of your HashTable for use in your driver function.

HashTable stopWordsTable: store/look up stop words

HashTable uniqueWordsTable: store/look up unique non-stop words.

When executing your driver program:

Read in the number of most common words to process from the first command-line argument.

Read in the name of the text file to process from the second command-line argument.

Create a hash table stopWordsTable of size STOPWORD_TABLE_SIZE to store the stop words in. Populate the hash table with the stopwords read in from the third command-line argument, using the getStopWords function.

Create a second hash table uniqueWordsTable of size specified by the fourth command-line argument. This hash table will store and count the words from the text file which are not stop-words.

Store the unique words (excluding stop words) found in the file in the uniqueWordsTable hash table.

Check if the word is a stopword first, and if it is, then ignore that word.

Check if the word is in the table or not (hint: use isInTable method).

If it is, add one to the count (hint: use the incrementCount method).

If it is not present, add it to the table. Count the number of collisions. (hint: use the addWord method; collision counting should be done in this method as well)

Before your program ends, be sure to free all dynamically allocated memory via the class destructor.

Output the top n most frequent words, number of collisions, number of unique non-stop words, and total non-stop words using the specified output (see bottom of document for code, and above for example).

Dealing with Stop Words in your driver:

Write a function named getStopWords that takes the name of the stopwords file and a hash table to store the stopwords, fills the hash table with the stopwords, and returns void. Read in the file for a list of the top 50 most common words to ignore.

Stopwords should be saved into stopWordsTable (one of two hash tables you should create in your driver).

The file will have one word per line. We will test with files having different words in it!

This is similar to the getStopWords function from Homework 2, but uses a hash table instead of a vector!

Write a function called isStopWord that checks the stopWordsTable hash table, and returns True if the word is in the table or False if it isnt.

Hash Table Specifications: You will implement a hash table data structure for storing wordItems, using the HashTable class definition we have provided. You will create two objects using this class: one to for storing the stopwords and checking if a word is a stopword, and one to store and count the unique words (excluding stop words).

getHash method:

Implement a hashing function in HashTables member function getHash that will minimize collisions. There are many ways to do this, but for this assignment, we will use a hashing function known as DJB2. Pseudocode for this function is as follows:

unsigned int getHash(string word)

unsigned int hash = 5381

for each character c in word:

hash = hash*33 + c

hash = hash % hashTableSize

return hash

A detailed explanation of why this hash function is suitable for hashing strings can be found on the web. It is easy to see, however, why an overly simple hash function, such as summing up the ASCII values of the characters in a string, may not minimize collisions. A simple sum of ASCII values would result in a collision between all words made up of the same characters (cat and act, for example). The DJB2 hash avoids this by applying the multiplication step as it iterates through the characters in the string. However, due to repeated multiplication, it is possible that the hash value will undergo integer overflow. That is the reason we need to use an unsigned int. Feel free to experiment with different hash functions and hash table sizes and see what happens to the number of collisions, but use the DJB2 hash for your submission.

addWord method:

Use the hashing function (getHash) to determine where in the hash table the word should be stored.

If the list at that spot is empty: dynamically allocate a new wordItem struct, make this new struct the head of the linked list, and add 1 to the number of unique words.

If the list already has items in it: this is a collision! Dynamically allocate a new struct and add it at the head of the linked list. Increment the number of collisions.

This method of using linked lists to deal with collisions is called Separate Chaining with Linked Lists.

isInTable method:

Implement a member function isInTable that takes a string as an argument and returns true if the string is already stored in the hash table, or returns false otherwise.

incrementCount method:

Implement a member function incrementCount that increments the count of a word already stored in the hash table by 1.

addWord method:

Write a member function named addWord that creates a new wordItem struct and adds it to the hash table at the appropriate location.

searchTable method:

Write a member function named searchTable that takes a string as an argument and returns a pointer to the wordItem struct that stores the string, or nullptr if the string is not currently stored in the hash table.

getTotalNumWords method:

Write a member function that sums up the count of each word for each word in the hash table.

getter methods:

Implement getter methods for numCollisions and numItems.

printTopN method:

Write a member function named printTopN that takes the value of N as an argument and determines and prints the top N words in the array. Hint: Declare an array of pointers of size n (static declaration), and use the insertIntoSortedArray algorithm from Assignment 1 to fill this array with words with the largest counts.

Format your output the following way: when you output the top n words in the file, the output needs to be in order, with the most frequent word printed first (see code below and reference the example above).

Program Output Specifications:

Your driver needs to print its analysis of the text in the following format (see example at beginning of document):

Count - Word

Number of collisions:

Unique non-stop words:

Total non-stop words:

Generate the output with these commands:

In printTopN:

cout << wordItem.count << << wordItem.word << endl;

In main:

cout << # << endl;

cout << Number of collisons: << uniqueWordsTable.getNumCollisions() << endl;

cout << #<

cout << Unique non-stop words: << uniqueWordsTable.getNumItems() << endl;

cout << # << endl;

cout << Total non-stop words: << uniqueWordsTable.getTotalNumWords() << endl;

Note: To avoid confusion with moodle build errors, you can use the same options moodle does on your machine to get the same errors:

g++ -std=c++11 -Wall -Werror HashTable.cpp Driver.cpp

Submitting Your Code:

Log into Moodle and go to the Homework 7 link. It is set up in the quiz format. Follow the instructions on each question to submit all or parts of each assignment question. You can check your solution to each question by clicking on the Check button. Note that you can only submit your homework once.

Note: there is no late period on assignments! If you miss the deadline or do not do well, you can sign up for an optional grading interview to get up to half the points missed back. There is also an optional extra credit assignment at the end of the semester you can use to replace one of your homework scores.

---------------------------------

HashTable.hpp

----------------------------------

#ifndef HW_7_HASH_TABLE #define HW_7_HASH_TABLE

#include

// struct to store word + count combinations struct wordItem { std::string word; int count; wordItem* next; };

/* class HashTable for storing words. * You will create two hash tables in your driver: * - one for storing stop words * - one for storing unique non-stop words * Why? We can load all the stopwords into the first table. * Then, we can quickly check that first hash table to see if * a word is a stopword before adding it to the second. */ class HashTable { public: HashTable(int hashTableSize); ~HashTable(); void addWord(std::string word); bool isInTable(std::string word); void incrementCount(std::string word); void printTopN(int n); int getNumCollisions(); int getNumItems(); int getTotalNumWords();

private: /* member functions */ unsigned int getHash(std::string word); wordItem* searchTable(std::string word);

/* instance variables */ wordItem** hashTable; int hashTableSize; int numItems; int numCollisions; };

/* size to make your stopwords hash table */ const int STOPWORD_TABLE_SIZE = 50;

/* Required functions for use in main. * you are free to also define your own helper functions * for your driver in your .cpp files if you wish. * These are the same functions from hw2, but use a * hashtable instead of a vector. */

/* load stopwords into the stopwords hash table */ void getStopWords(char *ignoreWordFileName, HashTable &stopWordsTable); /* check table to see if a word is a stopword or not */ bool isStopWord(std::string word, HashTable &stopWordsTable);

#endif

-------------------------------------------------------------------

Stopwords.txt

------------------------------------------------------------------

the be to of and a in that have i it for not on with he as you do at this but his by from they we say her she or an will my one all would there their what so up out if about who get which go me

-------------------------------------------------------

The other text file is to long to post. Its Hunger Games 1 in a text file.