Answered step by step
Verified Expert Solution
Link Copied!

Question

00
1 Approved Answer

SMS Spam Classification: Detecting Unwanted Messages Life Cycle of the Project Steps to be Performed Introduction Problem Statement Data Checks to Perform Data Cleaning EDA

SMS Spam Classification: Detecting Unwanted Messages
Life Cycle of the Project
Steps to be Performed
Introduction
Problem Statement
Data Checks to Perform
Data Cleaning
EDA
Text Preprocessing
Model Training
Evaluation
Conclusion
Author Message
1. Introduction
This Kaggle notebook presents a step-by-step guide to building an efficient SMS spam classification model using the SMS Spam Collection dataset. By the end of this notebook, you'll have a powerful tool to help you filter out unwanted messages and ensure that your text messaging experience is smoother and safer.
2. Problem Statement
The primary goal of this notebook is to develop a predictive model that accurately classifies incoming SMS messages as either ham or spam. We will use the SMS Spam Collection dataset, which consists of 5,574 SMS messages tagged with their respective labels.
3. Data Checks to Perform
3.1 Import Necessary Libraries
# Importing necessary libraries
import numpy as np # For numerical operations
import pandas as pd # For data manipulation and analysis
import matplotlib.pyplot as plt # For data visualization
%matplotlib inline
# Importing WordCloud for text visualization
from wordcloud import WordCloud
# Importing NLTK for natural language processing
import nltk
from nltk.corpus import stopwords # For stopwords
# Downloading NLTK data
nltk.download('stopwords') # Downloading stopwords data
nltk.download('punkt') # Downloading tokenizer data
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
Back to the Top
3.2 Load the Data
df = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv', encoding='latin1')
styled_df = df.head()
styled_df = styled_df.style.set_table_styles([
{"selector": "th", "props": [("color", 'black'),("background-color", "#FF00CC")]}
])
styled_df
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... nan nan nan
1 ham Ok lar... Joking wif u oni... nan nan nan
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's nan nan nan
3 ham U dun say so early hor... U c already then say... nan nan nan
4 ham Nah I don't think he goes to usf, he lives around here though nan nan nan
Back to the Top
4. Data Cleaning
4.1| Data Info
df.info()
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
# Column Non-Null Count Dtype
----------------------------
0 v15572 non-null object
1 v25572 non-null object
2 Unnamed: 250 non-null object
3 Unnamed: 312 non-null object
4 Unnamed: 46 non-null object
dtypes: object(5)
memory usage: 217.8+ KB
4.2| Drop the Columns
df.drop(columns =['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], inplace = True)
styled_df = df.head(5).style
# Modify the color and background color of the table headers (th)
styled_df.set_table_styles([
{"selector": "th", "props": [("color", 'Black'),("background-color", "#FF00CC"),('font-weight', 'bold')]}
])
v1 v2
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives around here though
4.3| Rename the Column
# Rename the columns name
df.rename(columns ={'v1': 'target', 'v2': 'text'}, inplace = True)
4.4| Convert the target variable
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['target']= encoder.fit_transform(df['target'])
styled_df = df.head().style
# Modify the color and background color of the table headers (th)
styled_df.set_table_styles([
{"selector": "th", "props": [("color", 'Black'),("background-color", "#FF00CC"),('font-weight', 'bold')]}
])
target text
00 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
10 Ok lar... Joking wif u oni...
21 Free entry in 2 a wkly comp to win FA Cup final t
Can you explain this code

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

Determine miller indices of plane X z 2/3 90% a/3

Answered: 1 week ago