ASAP Please use R language Decision tree classification You are provided two datasets from the 1 9 9 4 US Census database a training dataset ( adult train csv ) and a testing dataset ( adult test csv ) Each observation of the datasets has 1 5 attributes as described below The class variable ( response ) is stored in the last attribute and indicates whether a person makes more than $ 5 0 K per year The attributes are as follows age Age of the person ( numeric ) workclass Factor, one of Private, Self emp not inc, Self emp inc, Federal gov, Local gov, State gov, Without pay, Neverworked fnlwgt Final sampling weight ( used by Census Bureau to handle over and under sampling of particular groups ) education Factor, one of Bachelors, Some college, 1 1 th , HS grad, Prof school, Assoc acdm, Assoc voc, 9 th , 7 th 8 th , 1 2 th , Masters, 1 st 4 th , 1 0 th , Doctorate, 5 th 6 th , Preschool education num Number of years of education ( numeric ) marital status Factor, one of Married civ spouse, Divorced, Never married, Separated, Widowed, Married spouse absent , Married AF spouse occupation Factor, one of Tech support, Craft repair, Other service, Sales, Exec managerial, Prof specialty, Handlerscleaners, Machine op inspct, Adm clerical, Farming fishing, Transport moving, Priv house serv, Protective serv, ArmedForces relationship Factor, one of Wife, Own child, Husband, Not in family, Other relative, Unmarried race Factor, one of White, Asian Pac Islander, Amer Indian Eskimo, Other, Black sex Factor, one of Female, Male capital gain Continuous capital loss Continuous hours per week Continuous native country Factor, one of United States, Cambodia, England, Puerto Rico, Canada, Germany, Outlying US ( GuamUSVI etc ) , India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad Tobago, Peru, Hong, Holand Netherlands income class variable ( response ) , factor, one of 5 0 K , 5 0 K using all of the predictors Answer the following questions through model introspection ( i ) Name the top three important predictors in the model ( ii ) The first split is done on which predictor What is the predicted class of the first node ( the first node here refers to the root node ) What is the distribution of observations between the 5 0 K classes at first node ( c ) Use the trained model from ( b ) to predict the test dataset Answer the following questions based on the outcome of the prediction and examination of the confusion matrix ( for floating point answers, assume 3 decimal place accuracy ) ( i ) What is the balanced accuracy of the model ( Note that in our test dataset, we have more observations of class 5 0 Thus, we are more interested in the balanced accuracy, instead of just accuracy Balanced accuracy is calculated as the average of sensitivity and specificity ) ( ii ) What is the balanced error rate of the model ( Again , because our test data is imbalanced, a balanced error rate makes more sense Balanced error rate 1 0 balanced accuracy ) ( iii ) What is the sensitivity Specificity ( iv ) What is the AUC of the ROC curve Plot the ROC curve ( d ) Print the complexity table of the model you trained Examine the complexity table and state whether the tree would benefit from a pruning If the tree would benefit from a pruning, at what complexity level would you prune it If the tree would not benefit from a pruning, provide reason why you think this is the case ( e ) Besides the class imbalance problem we see in the test dataset, we also have a class imbalance problem in the training dataset To solve this class imbalance problem in the training dataset, we will use undersampling, i e , we will undersample the majority class such that both classes have the same number of observations in the training dataset Before doing this part of the assignment, please set your seed to the value shown below set seed ( 1 1 2 2 ) ( i ) In the training dataset, how many observations are in the class 5 0 K ( ii ) Create a new training dataset that has equal representation of both classes i e , number of observations of class 5 0 K Call this new training dataset ( Use the sample ( ) method on the majority class to sample as many observations as there are in the minority class Do not use any other method for undersampling as your results will not match expectation if you do so ) ( iii ) Train a new model on the new training dataset, and then fit this model to the testing dataset Answer the following questions based on the outcome of the prediction and examination of the confusion matrix ( for floating point answers, assume 3 decimal place accuracy ) i ) What is the balanced accuracy of this model ( ii ) What is the balanced error rate of this model ( iii ) What is the sensitivity Specificity ( iv ) What is the AUC of the ROC curve Plot the ROC curve

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

ASAP! Please use R language!!! Decision tree classification You are provided two datasets from the 1 9 9 4 US Census database: a training dataset

ASAP! Please use R language!!!

Decision tree classification You are provided two datasets from the

1994

US Census database: a training dataset

(

adult

-

train.csv

)

and a testing dataset

(

adult

-

test.csv

) .

Each observation of the datasets has

15

attributes as described below. The class variable

(

response

)

is stored in the last attribute and indicates whether a person makes more than $

50

K per year. The attributes are as follows: age: Age of the person

(

numeric

)

workclass: Factor, one of Private, Self

-

emp

-

not

-

inc, Self

-

emp

-

inc, Federal

-

gov, Local

-

gov, State

-

gov, Without

-

pay, Neverworked. fnlwgt: Final sampling weight

(

used by Census Bureau to handle over and under

-

sampling of particular groups

) .

education: Factor, one of Bachelors, Some

-

college,

11

,

-

grad, Prof

-

school, Assoc

-

acdm, Assoc

-

voc,

9

, 7

- 8

, 12

,

Masters,

1

- 4

, 10

,

Doctorate,

5

- 6

,

Preschool. education

-

num: Number of years of education

(

numeric

) .

marital

-

status: Factor, one of Married

-

civ

-

spouse, Divorced, Never

-

married, Separated, Widowed, Married

-

spouse

-

absent

,

Married

-

-

spouse. occupation: Factor, one of Tech

-

support, Craft

-

repair, Other

-

service, Sales, Exec

-

managerial, Prof

-

specialty, Handlerscleaners, Machine

-

-

inspct, Adm

-

clerical, Farming

-

fishing, Transport

-

moving, Priv

-

house

-

serv, Protective

-

serv, ArmedForces. relationship: Factor, one of Wife, Own

-

child, Husband, Not

-

-

family, Other

-

relative, Unmarried. race: Factor, one of White, Asian

-

Pac

-

Islander, Amer

-

Indian

-

Eskimo, Other, Black. sex: Factor, one of Female, Male capital

-

gain: Continuous capital

-

loss: Continuous hours

-

per

-

week: Continuous native

-

country: Factor, one of United

-

States, Cambodia, England, Puerto

-

Rico, Canada, Germany, Outlying

-

(

GuamUSVI

-

etc

),

India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican

-

Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El

-

Salvador, Trinadad&Tobago, Peru, Hong, Holand

-

Netherlands. income: class variable

(

response

),

factor, one of

> 50

, 50

K using all of the predictors. Answer the following questions through model introspection:

(

)

Name the top three important predictors in the model?

(

)

The first split is done on which predictor? What is the predicted class of the first node

(

the first node here refers to the root node

) ?

What is the distribution of observations between the

50

classes at first node?

(

)

Use the trained model from

(

)

to predict the test dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix:

(

for floating point answers, assume

3

decimal place accuracy

)

(

)

What is the balanced accuracy of the model?

(

Note that in our test dataset, we have more observations of class

50 .

Thus, we are more interested in the balanced accuracy, instead of just accuracy. Balanced accuracy is calculated as the average of sensitivity and specificity.

) (

)

What is the balanced error rate of the model?

(

Again

,

because our test data is imbalanced, a balanced error rate makes more sense. Balanced error rate

= 1.0

balanced accuracy.

) (

iii

)

What is the sensitivity

?

Specificity?

(

)

What is the AUC of the ROC curve. Plot the ROC curve.

(

)

Print the complexity table of the model you trained. Examine the complexity table and state whether the tree would benefit from a pruning. If the tree would benefit from a pruning, at what complexity level would you prune it

?

If the tree would not benefit from a pruning, provide reason why you think this is the case.

(

)

Besides the class imbalance problem we see in the test dataset, we also have a class imbalance problem in the training dataset. To solve this class imbalance problem in the training dataset, we will use undersampling, i

.

.,

we will undersample the majority class such that both classes have the same number of observations in the training dataset. Before doing this part of the assignment, please set your seed to the value shown below:

>

set.seed

(1122) (

)

In the training dataset, how many observations are in the class

50

? (

)

Create a new training dataset that has equal representation of both classes; i

.

.,

number of observations of class

50

.

Call this new training dataset.

(

Use the sample

()

method on the majority class to sample as many observations as there are in the minority class. Do not use any other method for undersampling as your results will not match expectation if you do so

.) (

iii

)

Train a new model on the new training dataset, and then fit this model to the testing dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix:

(

for floating point answers, assume

3

decimal place accuracy

)

: i

)

What is the balanced accuracy of this model?

(

)

What is the balanced error rate of this model?

(

iii

)

What is the sensitivity

?

Specificity?

(

)

What is the AUC of the ROC curve. Plot the ROC curve.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

★★★★★

Before starting an SQL Server Analysis Services Multidimensional Modeling Project, why is identification of a Data Source important?

Answered: 1 week ago

Previous Question Next Question