Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

ASAP! Please use R language!!! Decision tree classification You are provided two datasets from the 1 9 9 4 US Census database: a training dataset

ASAP! Please use R language!!!
Decision tree classification You are provided two datasets from the 1994 US Census database: a training dataset (adult-train.csv) and a testing dataset (adult-test.csv). Each observation of the datasets has 15 attributes as described below. The class variable (response) is stored in the last attribute and indicates whether a person makes more than $50K per year. The attributes are as follows: age: Age of the person (numeric) workclass: Factor, one of Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Neverworked. fnlwgt: Final sampling weight (used by Census Bureau to handle over and under-sampling of particular groups). education: Factor, one of Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th,7th-8th,12th, Masters, 1st-4th,10th, Doctorate, 5th-6th, Preschool. education-num: Number of years of education (numeric). marital-status: Factor, one of Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Factor, one of Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlerscleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, ArmedForces. relationship: Factor, one of Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: Factor, one of White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Factor, one of Female, Male capital-gain: Continuous capital-loss: Continuous hours-per-week: Continuous native-country: Factor, one of United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(GuamUSVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. income: class variable (response), factor, one of >50K,50K using all of the predictors. Answer the following questions through model introspection: (i) Name the top three important predictors in the model? (ii) The first split is done on which predictor? What is the predicted class of the first node (the first node here refers to the root node)? What is the distribution of observations between the 50K classes at first node? (c) Use the trained model from (b) to predict the test dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix: (for floating point answers, assume 3 decimal place accuracy): (i) What is the balanced accuracy of the model? (Note that in our test dataset, we have more observations of class 50. Thus, we are more interested in the balanced accuracy, instead of just accuracy. Balanced accuracy is calculated as the average of sensitivity and specificity.)(ii) What is the balanced error rate of the model? (Again, because our test data is imbalanced, a balanced error rate makes more sense. Balanced error rate =1.0 balanced accuracy.)(iii) What is the sensitivity? Specificity? (iv) What is the AUC of the ROC curve. Plot the ROC curve. (d) Print the complexity table of the model you trained. Examine the complexity table and state whether the tree would benefit from a pruning. If the tree would benefit from a pruning, at what complexity level would you prune it? If the tree would not benefit from a pruning, provide reason why you think this is the case. (e) Besides the class imbalance problem we see in the test dataset, we also have a class imbalance problem in the training dataset. To solve this class imbalance problem in the training dataset, we will use undersampling, i.e., we will undersample the majority class such that both classes have the same number of observations in the training dataset. Before doing this part of the assignment, please set your seed to the value shown below: > set.seed(1122)(i) In the training dataset, how many observations are in the class 50K?(ii) Create a new training dataset that has equal representation of both classes; i.e., number of observations of class 50K. Call this new training dataset. (Use the sample() method on the majority class to sample as many observations as there are in the minority class. Do not use any other method for undersampling as your results will not match expectation if you do so.)(iii) Train a new model on the new training dataset, and then fit this model to the testing dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix: (for floating point answers, assume 3 decimal place accuracy): i) What is the balanced accuracy of this model? (ii) What is the balanced error rate of this model? (iii) What is the sensitivity? Specificity? (iv) What is the AUC of the ROC curve. Plot the ROC curve.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

0134773659, 978-0134773650

More Books

Students also viewed these Databases questions

Question

=+ How well do you think you could do your job?

Answered: 1 week ago