Answered step by step
Verified Expert Solution
Link Copied!

Question

00
1 Approved Answer

Write program analyzeMessages(filename, minWordLengthToConsider = 1) that analyzes word frequencies in real-world text messages. Text file SMScollection.txt contains 5574 SMS messages. There is additional information

Write program analyzeMessages(filename, minWordLengthToConsider = 1) that analyzes word frequencies in real-world text messages. Text file SMScollection.txt contains 5574 SMS messages. There is additional information about the contents of the file in the associated "readme" file readmeSMScollection.txt, written by the creators of the dataset. The data was originally from this no-longer-working link: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. Some information about the data set and its initial investigators is now here. Each line of the file is represents one SMS/text message. The first item on every line is a label - 'ham' or 'spam' - indicating whether that line's SMS is considered spam or not. The rest of the line contains the text of the SMS/message. For example:

spam Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! Call ... ham Sorry, I'll call later in meeting. 

At the end, your program must print summary information, including at least:

  • the number of ham and number of spam messages
  • the total number of words found in ham messages and in spam messages
  • the number of unique words found in ham messages and in spam messages
  • information, for both ham and for spam, about the twelve (at least) most frequently occurring words that are at least minWordLengthToConsider characters long. This information must include both the count of the number of occurrences and the relative frequency of a word's occurrence as a percentage (how many times that word appears out of the total number of words in the relevant message set. For example, if "you" appeared 80 times in ham, out of 1250 total ham word occurrences, the frequency would be 6.4%).
  • the average length (in words, not characters) of ham messages and of spam messages

Feel free to compute and print out additional information as well. To accomplish this, your analyzeMessages function should:

  • read all of the data from the input file
  • extract individual words from the messages. This should include an effort to get ride of "extras" such as periods, commas, question and exclamation marks, and other characters that aren't part of a word. You should probably also ignore capitalization. Thus in the sample spam message above, you probably want to treat "Congrats!" as "congrats" in your frequency analysis. Note: the string strip() method is very useful for this. I recommend you do not use the replace() method.
  • build two dictionaries (Note: using dictionaries is required for full credit on this assignment), one for frequencies of words appearing in spam messages, one for frequencies of words from ham messages.
  • extract the most frequently occurring words (of length at least minWordLengthToConsider)
  • computer and print summary information

File With Text Messages: http://homepage.divms.uiowa.edu/~cremer/courses/cs1210/hw/hw5/SMScollection.txt

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions