Question
Zipfs Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank
Zipfs Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank in the frequency table. Frequency is the number of times a word appears in the text. When words are ranked according to frequency, the most frequent word is given rank 1, the next most frequent word is given rank 2, etc. Ties are handled by assigning the middle value in a range of ranks. For example, in a two-way tie for most frequent word, each word is given the rank of 1.5.
Write a C++ program to investigate the validity of Zipfs Law. Read in words from a text file, counting their frequency of occurrence. Print out the words sorted by frequency (highest frequency first). Within each frequency group, sort the words alphabetically. Then print a table that lists word rank, frequency, and rank * frequency to determine whether Zipfs Law holds true. According to Zipfs Law, the product of rank and frequency should be (roughly) constant.
Test your program on this sample textfile:
This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work. Tra la la la la, la la la la.
Implementation Details
The input text file consists of a stream of ASCII characters that you must parse into words. For this assignment, restrict words to strings of letters, possibly containing embedded single quotes (contractions). Words are separated from the surrounding text by anything that is not a letter. Make your program case-insensitive by converting all words to lower case. (Hint: the strtok() routine is useful for parsing text into words.)
When a great deal of searching is required, a hash table is an appropriate data structure. Hash table performance is O(1) average for find, insert and delete operations, allowing you to insert new words and increment word frequencies very efficiently. Using a hash table, write an application to test Zipfs Law.
Implementation notes
Store the words and their frequencies in a hash table that you implement with a hash table class. (Implement your own hash table. Do not use any STL functionality.) Include a constructor, a destructor, and member functions for inserting, deleting, and finding items in the hash table. (I recognize that there are no deletions in this assignment. Humor me.) The hash function is up to you, but make it a good one for strings. Use open addressing with linear probing to resolve collisions.
Start with a dynamically-allocated hash table of about 1K entries (remember, the hash table size should generally be prime). Whenever the hash table gets over 75% full, rehash the contents into a new hash table, approximately twice as large as the previous one. Print a message to the screen, notifying the user when rehashing occurs.
Open addressing makes it easy to sort the hash table after counting the word frequencies. To sort the table, use the standard library routine qsort(). Words should be sorted first by frequency, and then alphabetically within frequency groups.
Output the results to two files, with the same name as the input text file, but different extensions. For example, given an input file named textfile.txt, print the word concordance to a file named textfile.wrd, and the rank vs. frequency table to a file named textfile.zpf. Output files should include headers with the input filename, the total number of words, and the total number of distinct words.
Program output
The output of your program should include the following:
Print the word frequencies to a file with the same name as the input file, but with the extension .wrd. For example, given an input text file file.txt, write the output to file.wrd. File header info should include the input filename, total number of words processed, and number of distinct words found.
Print words in frequency groups, sorted alphabetically within each group. Do not print empty frequency groups. Print multiple words per line, left justified within columns. For formatting purposes, you may assume that words will not exceed 15 characters in length.
Print rank and frequency information to a CSV (comma separated value) file, suitable for importing into an Excel spreadsheet. This file should have the same name as the input text file, with a .csv extension. Include the same file header info as in the .wrd file.
Print timing results to the console. This will give an indication of how efficiently your program executes.
Program execution
% zipf testfile.txt
Read 31 words from the file testfile.txt.
Inserted 20 distinct words into the hash table.
Elapsed time = 0.0 msec
Contents of input file: testfile.txt
This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work.
Tra la la la la, la la la la.
Contents of output file: testfile.wrd
Zipf's Law: word concordance
----------------------------
File: testfile.txt
Total words: 31
Distinct words: 20
Word Frequencies Ranks Avg Rank
---------------- ----- --------
Words occurring 8 times: 1 1.0 la
Words occurring 3 times: 2 2.0 of
Words occurring 2 times: 3-4 3.5 is
this
Words occurring once: 5-20 12.5 a
demonstrating how importance no nonsense
other particular pile program supposed
than that to tra work
Contents of output file: testfile.csv
Zipf's Law: rank * freq = const
-------------------------------
File: testfile.txt
Total words: 31
Distinct words: 20
rank, freq,
r*f 1.0, 8,
8.0 2.0, 3,
6.0 3.5, 2,
7.0
12.5, 1, 12.5
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started