Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Decision Tree, post - pruning and cost complexity parameter using sklearn 0 . 2 2 [ 1 0 points, Peer Review ] We will use

Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22[10 points, Peer Review]
We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters.
Word frequency is given by:
=/
Where
is the frequency for word
,
is the number of times word
appears in the email, and
is the total number of words in the email.
We will use decision trees to classify the emails.
Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets.
My Code:
def get_spam_dataset(filepath="data/spamdata.csv", test_split=0.1):
'''
get_spam_dataset
Loads csv file located at "filepath". Shuffles the data and splits
it so that the you have (1-test_split)*100% training examples and
(test_split)*100% testing examples.
Args:
filepath: location of the csv file
test_split: percentage/100 of the data should be the testing split
Returns:
X_train, X_test, y_train, y_test, feature_names
Note: feature_names is a list of all column names including isSpam.
(in that order)
first four are np.ndarray
'''
# your code here
# Read CSV file
data = pd.read_csv(filepath, header=None, delimiter='')
# Shuffle the data
data = data.sample(frac=1, random_state=42).reset_index(drop=True)
# Extract features and target variable
X = data.iloc[:, :-1].values
y = data.iloc[:,-1].values
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_split, random_state=42)
# Get feature names
feature_names =[f"word_freq_{i}" for i in range(1, X.shape[1]+1)]
return X_train, X_test, y_train, y_test, feature_names
# TO-DO: import the data set into five variables: X_train, X_test, y_train, y_test, label_names
# Uncomment and edit the line below to complete this task.
test_split =0.1 # default test_split; change it if you'd like; ensure that this variable is used as an argument to your function
# your code here
X_train, X_test, y_train, y_test, label_names = get_spam_dataset(filepath="data/spamdata.csv", test_split=0.1)
# X_train, X_test, y_train, y_test, label_names = np.arange(5)
# Print the shapes of X_train and y_train
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
# Print label_names
print("Label names:", label_names)
its returning wrong answer , can someone help.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

25 Vba Macros For Data Analysis In Microsoft Excel

Authors: Klemens Nguyen

1st Edition

B0CNSXYMTC, 979-8868455629

Students also viewed these Databases questions

Question

How and why do the genders differ in mental ability scores?

Answered: 1 week ago

Question

Prepare an ID card of the continent Antarctica?

Answered: 1 week ago

Question

What do you understand by Mendeleev's periodic table

Answered: 1 week ago