Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Zipfs Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank

Zipfs Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank in the frequency table. Frequency is the number of times a word appears in the text. When words are ranked according to frequency, the most frequent word is given rank 1, the next most frequent word is given rank 2, etc. Ties are handled by assigning the middle value in a range of ranks. For example, in a two-way tie for most frequent word, each word is given the rank of 1.5.

Write a C++ program to investigate the validity of Zipfs Law. Read in words from a text file, counting their frequency of occurrence. Print out the words sorted by frequency (highest frequency first). Within each frequency group, sort the words alphabetically. Then print a table that lists word rank, frequency, and rank * frequency to determine whether Zipfs Law holds true. According to Zipfs Law, the product of rank and frequency should be (roughly) constant.

Test your program on this sample textfile:

This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work. Tra la la la la, la la la la.

Implementation Details

The input text file consists of a stream of ASCII characters that you must parse into words. For this assignment, restrict words to strings of letters, possibly containing embedded single quotes (contractions). Words are separated from the surrounding text by anything that is not a letter. Make your program case-insensitive by converting all words to lower case. (Hint: the strtok() routine is useful for parsing text into words.)

When a great deal of searching is required, a hash table is an appropriate data structure. Hash table performance is O(1) average for find, insert and delete operations, allowing you to insert new words and increment word frequencies very efficiently. Using a hash table, write an application to test Zipfs Law.

Implementation notes

Store the words and their frequencies in a hash table that you implement with a hash table class. (Implement your own hash table. Do not use any STL functionality.) Include a constructor, a destructor, and member functions for inserting, deleting, and finding items in the hash table. (I recognize that there are no deletions in this assignment. Humor me.) The hash function is up to you, but make it a good one for strings. Use open addressing with linear probing to resolve collisions.

Start with a dynamically-allocated hash table of about 1K entries (remember, the hash table size should generally be prime). Whenever the hash table gets over 75% full, rehash the contents into a new hash table, approximately twice as large as the previous one. Print a message to the screen, notifying the user when rehashing occurs.

Open addressing makes it easy to sort the hash table after counting the word frequencies. To sort the table, use the standard library routine qsort(). Words should be sorted first by frequency, and then alphabetically within frequency groups.

Output the results to two files, with the same name as the input text file, but different extensions. For example, given an input file named textfile.txt, print the word concordance to a file named textfile.wrd, and the rank vs. frequency table to a file named textfile.zpf. Output files should include headers with the input filename, the total number of words, and the total number of distinct words.

Program output

The output of your program should include the following:

Print the word frequencies to a file with the same name as the input file, but with the extension .wrd. For example, given an input text file file.txt, write the output to file.wrd. File header info should include the input filename, total number of words processed, and number of distinct words found.

Print words in frequency groups, sorted alphabetically within each group. Do not print empty frequency groups. Print multiple words per line, left justified within columns. For formatting purposes, you may assume that words will not exceed 15 characters in length.

Print rank and frequency information to a CSV (comma separated value) file, suitable for importing into an Excel spreadsheet. This file should have the same name as the input text file, with a .csv extension. Include the same file header info as in the .wrd file.

Print timing results to the console. This will give an indication of how efficiently your program executes.

Program execution

% zipf testfile.txt

Read 31 words from the file testfile.txt.

Inserted 20 distinct words into the hash table.

Elapsed time = 0.0 msec

Contents of input file: testfile.txt

This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work.

Tra la la la la, la la la la.

Contents of output file: testfile.wrd

Zipf's Law: word concordance

----------------------------

File: testfile.txt

Total words: 31

Distinct words: 20

Word Frequencies Ranks Avg Rank

---------------- ----- --------

Words occurring 8 times: 1 1.0 la

Words occurring 3 times: 2 2.0 of

Words occurring 2 times: 3-4 3.5 is

this

Words occurring once: 5-20 12.5 a

demonstrating how importance no nonsense

other particular pile program supposed

than that to tra work

Contents of output file: testfile.csv

Zipf's Law: rank * freq = const

-------------------------------

File: testfile.txt

Total words: 31

Distinct words: 20

rank, freq,

r*f 1.0, 8,

8.0 2.0, 3,

6.0 3.5, 2,

7.0

12.5, 1, 12.5

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Secrets Of Analytical Leaders Insights From Information Insiders

Authors: Wayne Eckerson

1st Edition

1935504347, 9781935504344

More Books

Students also viewed these Databases questions

Question

4. I can tell when team members dont mean what they say.

Answered: 1 week ago