Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

FOR PYTHON a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional

FOR PYTHON

a. Write a program, hamAndSpam(smsFilename), that analyzes word frequencies in real-world text messages. Text file SMSSpamCollection.txt contains 5574 SMS messages. There is additional information about the contents of the file in the associated "readme" file readmeSMSSpamCollection.txt, written by the creators of the dataset. The data was originally from this no-longer-working link: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. Some information about the data set and its initial investigators is now here. Each line of the file is represents one SMS/text message. The first item on every line is a label - 'ham' or 'spam' - indicating whether that line's SMS is considered spam or not. The rest of the line contains the text of the SMS/message. For example:

spam Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! Call ... ham Sorry, I'll call later in meeting. 

At the end, your program must print summary information and information about the most frequent words in spam messages and the most frequent words in non-spam (ham) messages. It should also compute and compare the average lengths of spam and ham messages. I will not specify exactly what your output should be (but I will demonstrate sample output during the next lecture or two. I will also provide organizational hints and help for each of the parts). To accomplish this, your hamAndSpam function should:

read all of the data from the input file

extract individual words from the messages. This should include an effort to get ride of "extras" such as periods, commas, question and exclamation marks, and other characters that aren't part of a word. You should probably also ignore capitalization. Thus in the sample spam message above, you probably want to treat "Congrats!" as "congrats" in your frequency analysis.

build two dictionaries (required for full credit on this assignment), one for frequencies of words appearing in spam messages, one for frequencies of words from ham messages.

print summary information and some word frequency information about the data.

Again, it is up to you to decide exactly what to print, though it must include some word frequency and message length information as mentioned above. Summary information might include the number of spam/non-spam messages, the total number of different words in spam and non-spam messages, the total number of words in each, and anything else that might be interesting (does spam or non-spam have longer average word length?? longer message length??). Frequency information might be in the form of the top ten most used words in spam and in non-spam, along with a measure of their frequency (is a absolute number of occurrences a good measure? Or might it be better to use a fraction/percentage of all occurrences in that type of message). Possibly also consider printing information about most frequent words with more than, say, one or two or three letters - the results might be more enlightening. (You could also, but it is not required, compare the results with the list of 5000 most common English words of the file words5000.txt - most common word first - from HW4.) b. Write a couple of sentences/short paragraph saying something about the results. Can you conclude something about spam vs. non-spam? Did you learn something? Put this answer as a comment at the top of your .py file. Thus, your file should look like:

# 1b. ... your answer here ... # .... # # def hamAndSpam(smsFilename): ... 

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Introduction To Constraint Databases

Authors: Peter Revesz

1st Edition

1441931554, 978-1441931559

More Books

Students also viewed these Databases questions

Question

What is operatiing system?

Answered: 1 week ago