Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Machine learning-based SMS Spam Filtering Project Statements - Objective For this project, you are asked to implement a detection program supporting Short Message Service (SMS)
Machine learning-based SMS Spam Filtering Project Statements - Objective For this project, you are asked to implement a detection program supporting Short Message Service (SMS) spam filtering. The main concern is to design/generate features to differentiate SMS spam messages from legitimate ones, and run machine learning techniques (i.e., supervised learning) to classify SMS spam messages. Unlike email spam filtering, SMS spam filtering poses its own intrinsic problem because the length of text messages is relatively small (up to 160 characters or less). To come up with this project successfully, you must devise robust and efficient detection features to solve this problem. . - Dataset You will explore real SMS Spam Collection Data Set, corpus of mobile SMS labeled Spam/Legitimate. The SMS Spam collection contains a total of 1324 SMS messages, which is composed of 82 spam and 1002 legitimate messages. Each line has one SMS message including two columns separated by ,: the second column indicates label if the SMS message is spam or legitimate (ham), and the first one is the content of the message (i.e., raw text). Table 1 shows some samples of SMS Spam/Legitimate messages in the given dataset. Table 1: SMS Spam/legitimate message samples
Project Requirements - Feature Extraction Write a java (or python, or other program languages but you need to give a demo to me) program for the detection features from raw text and generate a feature set file. You will be programming FOUR detection features and are responsible for justifying the implementation strategy for each of the features in the report. If you are using any references in terms of the implementation, you should cite and explain it concisely. For project 1, the detection features are:
- Binary Classification From WEKA, run different binary classifiers to identify SMS Spam messages using the feature set you devised. You should run the following FIVE classification algorithms with Cross Validation (by default, K =10) and report the experimental results you analyzed. Specifically, you must report the best classifier with i) Accuracy, ii) True Positive rate (TPR), and iii) False Positive rate (FPR).
Label | Text |
spam | "For the most sparkling shopping breaks from 45 per person; call 0121 2025050 or visit www.shortbreaks.org.uk" |
spam | December only! Had your mobile 11mths+? You are entitled to update to the latest colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906 |
ham | Yup next stop. |
ham | No. I dont want to hear anything |
Detection feature | Description |
Number of Characters typed in Message | In Table 1 above, it appears that SMS spam messages include relatively large number of characters than legitimate messages. From the SMS spammers' (e.g., marketers) perspective, they are likely to use more characters available as long as it doesn't exceed the limit of SMS to send the sufficient information to customers for illicit profits. |
Number of Currency Symbols | To take the bait by mobile users, SMS spammers might emphasize the prize (or cash) using the currency symbol (e.g., ) in the SMS message. This is typically different from legitimate messages. Here is an example, Please call our customer service representative on 0800 169 6031 between 10am-9pm as you have WON a guaranteed 1000 cash or 5000 prize! |
Number of Numeric String | One of the intrinsic factors from the SMS spams is a CONTACT number or PROMOTION code. Since the phone number is sensitive, it is not likely to be in the legitimate messages frequently. (Example: PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08719899230 Identifier Code: 41685 Expires 07/11/04) |
The frequency of most popular term/word |
- Decision Tree (J48)
- Multinomial Naive Bayes
- K-Nearest Neighbors
- SVM (LibSVM)
- RandomForest
- Source codes: your java (or python) codes and a shell script files (if applicable) that you use for testing. Your code also needs to be well-documented, with any major constructs (i.e., functions) clearly commented.
- README: overall high-level documentation of instructions to run your program. (1 page)
- Report: well-documented summary that includes the annotation/justification of detection features and the experimental results you analyzed. (at least 5 pages, the more formal the better.)
- Dataset: a feature set data in the format of ARFF to be tested
Attachments:
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started