Question
Problem 1. Download the data set salary-class.csv. This data set (drawn from census data) will be used to predict whether a person has income more
Problem 1. Download the data set salary-class.csv. This data set (drawn from census data) will be
used to predict whether a person has income more or less than $50K (this is 1990 data). The fields are
as follows:
AGE: Age of person
EMPLOYER: area of employment (government, private, etc.)
DEGREE: Highest academic degree
MSTATUS: Marital Status
JOBTYPE: type of job (clerical, cleaners, etc.)
SEX: male or female
C-GAIN: capital gain claimed on taxes last year
C-LOSS: capital loss claimed on taxes last year
HOURS: average hours per week worked
COUNTRY: country of origin
INCOME: 50K or > 50K
Use R to answer the questions below.
(a) Check if there are any missing values or outliers in the dataset. If there are, how do you handle
them? (simply include your R code. No need for explanations)
(b) How does income vary for different ages? (You can draw histogram or boxplot of age for different
classes of income and compare the results).
(c) How does income vary at different countries?
(d) How does income vary between men and women?
(e) What is the income of most female under the age of 30 who work for a private sector? Is it
more than 50K or less than 50K?
(f) Divide the data into 60% training and 40% testing sets.
(g) Create the default C&R decision tree. How many leaves are in the tree?
(h) What are the major predictors of INCOME? Justify your choice. How can you get this information from the software?
(i) Give three rules that describe who is likely to have an INCOME > 50K and who is likely to have
an income 50K. These rules should be relevant (support at least 5% in the training sample)
and strong (either confidence more than 75% "> 50K" or 90% " 50K"). If there are no three
rules that meet these criteria, give the three best rules you can.
(j) Create two more C&R trees. The first is just like the default tree except you do not "prune tree
to avoid overtting" (you need to let the model to grow to its full depth). The other does prune,
but you require 500 records in a parent branch and 100 records in a child branch. You can also
play with the complexity parameter. How do the three trees differ? Briefly explain.
(k) Which of your tree models seems most accurate on the training data? Which seems most
accurate on the test data?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started