Question

1 Approved Answer

Posted on Jul 28, 2024

In C language, write a program that analyses text. Specifically, you will be counting the number of times that each specific word occurs in a

In C language, write a program that analyses text. Specifically, you will be counting the number of times that each specific word occurs in a document. This is sometimes called a "bag of words". This calculation is often the first part of text analysis, including document classification (including detecting spam) and sentiment analysis. You will be required to produce your code in a specific way, so that we can exercise your ability to use memory allocation and arrays.

Data

Download the "Data Folder" from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection# . Do this now so that you have this datafile (even if "the internet goes down" later). Unzip the files and move the file called SMSSpamCollection into the directory in which you are putting the code for your assignment. Don't change the name of that file. Warning: the SMSSpamCollection file may contain some crude and potentially offensive language and in no way reflect the opinions of anyone associated with this course (these are random SMS messages sent by anonymous users and collected in a public dataset). If you do not want to process a file that may contain crude or potentially offensive language, please contact the instructor an he will arrange an alternate file for you to process.

Structures

You will be required to use the following two structures to represent the "bag of words" in the computer's memory.

struct word_count_struct { char *word; int count; }; struct bag_struct { struct word_count_struct *bag; int bag_size; int total_words; };

The first structure represents a single word and the number of times that the word occurs in memory. Note that the structure does not include memory to store the word, only a pointer, you will need to allocate the memory for the word and make the pointer point to it.

The second structure represents all the word and count pairs as well as the total number of such pairs. Note that the structure does not include memory to store the bag, only a pointer, you will need to allocate the memory for the bag by allocating an array of word_count_structs and make the pointer point to it.

Do not change the definitions of these structures.

Functions

You will be writing the following functions for this program:

struct bag_struct *new_bag();

This function will return a pointer to a newly allocated bag_struct with a bag_size of zero, and total_words of zero (you will need to malloc the struct bag_struct).

int bagsearch( struct bag_struct *bow, char *word );

This function will perform a binary search of the "bag" array contained in the struct bag_struct pointed to by the pointer bow. The binary search should look in the middle of the array for the "word". If the "word" is alphabetically before the middle word in the "bow", then the search should be repeated on the first "half" of the bow. This can be done by creating a new struct bag_struct with the same "bag" and "bag_size" equal to ~1/2 of the original bag_size and calling "bagsearch" recursively. If the "word" is alphabetically after the middle word in the "bow", then the search should be repeated on the second half of the bow. This can be done by creating a new struct bag_struct with "bag" pointing one position past the middle element, and "bag_size" equal to ~1/2 of the original bag_size. (HINT: do a few examples to figure out exactly how to calculate 1/2 (round up or round down) so that you search exactly the right part of the array, and no further.) If at any point the middle element matches "word", then the function should return the index of "word" in the bag_struct. If "word" is not located in the bag_struct, then the return index should refer to the first word that is alphabetically after "word".

char *get_word( char **string_ptr );

This function should loop over the characters in the string that is pointed to by string_ptr. It should start by reading and skipping any letters that are not uppercase or lowercase A-Z letters. Next it should read characters, convert them to lower case, and remember the lower case characters until it encounters something other than an uppercase or lowercase letter. It should store all characters encountered prior to the non-letter in a newly allocated string variable just large enough to hold the letters and a terminating '\0'. Increment the value pointed to by string_ptr so that the first character that it points to is the first non-alphabetic character after the last alphabetic character that you stored. Return the pointer to the newly allocated string variable, unless there are no alphabetic characters left in the string. In that case return NULL.

struct word_count_struct *new_word_count( char *word );

This function should allocate a new struct word_count_struct and return a pointer to it. Additionally, it should set the "word" pointer to point to the argument. It should set the "count" integer to 1. It should not allocate memory for, or copy the word (we will assume that this was already done by get_word).

void add_word( struct bag_struct *bow, char *word );

This function should search the "bow" for the provided "word", using bagsearch. If "word" is found in the "bow", it should increment "count" within the appropriate element of "bag". If "word" is not found in the "bow", then "bag" of "bow" should be reallocated to hold one additional "struct word_count_struct", a new "struct word_count_struct" should be created (with "new_word_count"). The words that belong after the argument "word" should be moved 1 position later, and the new "struct word_count_struct" inserted in its alphabetical location.

void add_text( struct bag_struct *bow, char *line );

This function should identify all the words in the string line (using the get word function), and add them to the bow.

void read_sms_data( struct bag_struct *bow, char *label );

This function should open the file: SMSSpamCollection and read it one line at a time. Each line will consist of either the word "ham" or the word "spam" followed by a tab ('\t') character. You may assume that the maximum length of text message that you can send is 918 characters. And that each line is terminated by a single newline character (' '). The variable label will always contain either "ham" or "spam". As you read the file, one line at a time, you should check if it starts with "ham" or "spam". If it does not match the provided "label", then you should ignore the line and move to the next line. If it matches the label, then you should add the line using the add_text function to the bow.

void differences( struct bag_struct *ham, struct bag_struct *spam );

This function should go through the entries of ham and the spam. It will need to go through them asynchronously --- that is, both files at the same time, but not necessarily at the same speed. At each step through the loop it should look at the "current" word from each bag_struct. If the two words are the same then the program should (floating point) divide their counts, by their respective total_words computing two word frequencies (how often that word occurs divided by the total number of words that are in the bag). If one word comes alphabetically before the other, then the alphabetically first word's count should be divided by its respective total_words to compute the frequency, while the second frequency should be calculated as 1 divided by the other bag's total words (this is called a pseudo count). If neither of the two frequencies is above 0.005 (i.e. 0.5%), then we discard the word as too rare to be useful. Next we will calculate the ratio of the two frequencies (divide one by the other) to calculate how much more often one word occurs compared to the other in the "ham" vs "spam" bags. If the ratio is less than 1, we will invert it (divide 1 by it). Finally, if the ratio is greater than 2.0, use the following formatting string "%20s %8.6f %8.6f %10.6f %s" to print the word, the ham frequency, the spam frequency, the final ratio and either the word "ham" or "spam" if the word occurs more frequently in in "ham" or "spam" respectively. After processing one word, move on to the next work. NOTE: advance your position in both bags if the two words matched; otherwise, only advance your position in the bag that had the alphabetically earlier word.

Program

Write a program that opens the SMSSpamCollection file and reads it once, collecting all the ham messages using read_sms_data, then rewinds the file and again reads it, this time collecting all the spam messages using read_sms_data. You will need two bag_struct for this (one for the spam and one for the ham). Call the differences function to display the words which are most indicative of the message type.

NOTE: you are encouraged to write any additional helper functions that help to support your code and make it easier to understand, or to debug what your code is doing.