Answered step by step
Verified Expert Solution
Question
1 Approved Answer
ASAP! Please use R language!!! Decision tree classification You are provided two datasets from the 1 9 9 4 US Census database: a training dataset
ASAP! Please use R language!!!
Decision tree classification You are provided two datasets from the US Census database: a training dataset adulttrain.csv and a testing dataset adulttest.csv Each observation of the datasets has attributes as described below. The class variable response is stored in the last attribute and indicates whether a person makes more than $K per year. The attributes are as follows: age: Age of the person numeric workclass: Factor, one of Private, Selfempnotinc, Selfempinc, Federalgov, Localgov, Stategov, Withoutpay, Neverworked. fnlwgt: Final sampling weight used by Census Bureau to handle over and undersampling of particular groups education: Factor, one of Bachelors, Somecollege, th HSgrad, Profschool, Assocacdm, Assocvoc, thththth Masters, stthth Doctorate, thth Preschool. educationnum: Number of years of education numeric maritalstatus: Factor, one of Marriedcivspouse, Divorced, Nevermarried, Separated, Widowed, Marriedspouseabsent MarriedAFspouse. occupation: Factor, one of Techsupport, Craftrepair, Otherservice, Sales, Execmanagerial, Profspecialty, Handlerscleaners, Machineopinspct, Admclerical, Farmingfishing, Transportmoving, Privhouseserv, Protectiveserv, ArmedForces. relationship: Factor, one of Wife, Ownchild, Husband, Notinfamily, Otherrelative, Unmarried. race: Factor, one of White, AsianPacIslander, AmerIndianEskimo, Other, Black. sex: Factor, one of Female, Male capitalgain: Continuous capitalloss: Continuous hoursperweek: Continuous nativecountry: Factor, one of UnitedStates, Cambodia, England, PuertoRico, Canada, Germany, OutlyingUSGuamUSVIetc India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, DominicanRepublic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinadad&Tobago, Peru, Hong, HolandNetherlands. income: class variable response factor, one of KK using all of the predictors. Answer the following questions through model introspection: i Name the top three important predictors in the model? ii The first split is done on which predictor? What is the predicted class of the first node the first node here refers to the root node What is the distribution of observations between the K classes at first node? c Use the trained model from b to predict the test dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix: for floating point answers, assume decimal place accuracy: i What is the balanced accuracy of the model? Note that in our test dataset, we have more observations of class Thus, we are more interested in the balanced accuracy, instead of just accuracy. Balanced accuracy is calculated as the average of sensitivity and specificity.ii What is the balanced error rate of the model? Again because our test data is imbalanced, a balanced error rate makes more sense. Balanced error rate balanced accuracy.iii What is the sensitivity Specificity? iv What is the AUC of the ROC curve. Plot the ROC curve. d Print the complexity table of the model you trained. Examine the complexity table and state whether the tree would benefit from a pruning. If the tree would benefit from a pruning, at what complexity level would you prune it If the tree would not benefit from a pruning, provide reason why you think this is the case. e Besides the class imbalance problem we see in the test dataset, we also have a class imbalance problem in the training dataset. To solve this class imbalance problem in the training dataset, we will use undersampling, ie we will undersample the majority class such that both classes have the same number of observations in the training dataset. Before doing this part of the assignment, please set your seed to the value shown below: set.seedi In the training dataset, how many observations are in the class Kii Create a new training dataset that has equal representation of both classes; ie number of observations of class K Call this new training dataset. Use the sample method on the majority class to sample as many observations as there are in the minority class. Do not use any other method for undersampling as your results will not match expectation if you do soiii Train a new model on the new training dataset, and then fit this model to the testing dataset. Answer the following questions based on the outcome of the prediction and examination of the confusion matrix: for floating point answers, assume decimal place accuracy: i What is the balanced accuracy of this model? ii What is the balanced error rate of this model? iii What is the sensitivity Specificity? iv What is the AUC of the ROC curve. Plot the ROC curve.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started