Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

Decision Tree, post - pruning and cost complexity parameter using sklearn 0 . 2 2 [ 1 0 points, Peer Review ] We will use

Decision Tree, post

-

pruning and cost complexity parameter using sklearn

0.22 [10

points, Peer Review

]

We will use a pre

-

processed natural language dataset in the CSV file "spamdata.csv

"

to classify emails as spam or not. Each row contains the word frequency for

54

words plus statistics on the longest "run" of captial letters.

Word frequency is given by:

= /

Where

is the frequency for word

,

is the number of times word

appears in the email, and

is the total number of words in the email.

We will use decision trees to classify the emails.

Part A

[5

points

]

: Complete the function get

_

spam

_

dataset to read in values from the dataset and split the data into train and test sets.

My Code:

def get

_

spam

_

dataset

(

filepath

=

"data

/

spamdata

.

csv

",

test

_

split

= 0.1)

'''

get

_

spam

_

dataset

Loads csv file located at "filepath". Shuffles the data and splits

it so that the you have

(1 -

test

_

split

) * 100 %

training examples and

(

test

_

split

) * 100 %

testing examples.

Args:

filepath: location of the csv file

test

_

split: percentage

/ 100

of the data should be the testing split

Returns:

_

train, X

_

test, y

_

train, y

_

test, feature

_

names

Note: feature

_

names is a list of all column names including isSpam.

(

in that order

)

first four are np

.

ndarray

'''

# your code here

# Read CSV file

data

=

.

read

_

csv

(

filepath

,

header

=

None, delimiter

='')

# Shuffle the data

data

=

data.sample

(

frac

= 1,

random

_

state

= 42) .

reset

_

index

(

drop

=

True

)

# Extract features and target variable

=

data.iloc

[

,

- 1] .

values

=

data.iloc

[

, - 1] .

values

# Split the data into train and test sets

_

train, X

_

test, y

_

train, y

_

test

=

train

_

test

_

split

(

,

,

test

_

size

=

test

_

split, random

_

state

= 42)

# Get feature names

feature

_

names

= [

"

word

_

freq

_{

} "

for i in range

(1,

.

shape

[1] + 1)]

return X

_

train, X

_

test, y

_

train, y

_

test, feature

_

names

# TO

-

DO: import the data set into five variables: X

_

train, X

_

test, y

_

train, y

_

test, label

_

names

# Uncomment and edit the line below to complete this task.

test

_

split

= 0.1

# default test

_

split; change it if you'd like; ensure that this variable is used as an argument to your function

# your code here

_

train, X

_

test, y

_

train, y

_

test, label

_

names

=

get

_

spam

_

dataset

(

filepath

=

"data

/

spamdata

.

csv

",

test

_

split

= 0.1)

# X

_

train, X

_

test, y

_

train, y

_

test, label

_

names

=

.

arange

(5)

# Print the shapes of X

_

train and y

_

train

("

Shape of X

_

train:", X

_

train.shape

)

("

Shape of y

_

train:", y

_

train.shape

)

# Print label

_

names

("

Label names:", label

_

names

)

its returning wrong answer

,

can someone help.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

25 Vba Macros For Data Analysis In Microsoft Excel

Authors: Klemens Nguyen

★★★★★

A student organization to which you belong recently returned from its national conference. One of the highlights of the conference was announcement of Chapter of the Yearand your group won! Prepare...

Answered: 1 week ago

Previous Question Next Question