Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Problem 1. Download the data set salary-class.csv. This data set (drawn from census data) will be used to predict whether a person has income more

Problem 1. Download the data set salary-class.csv. This data set (drawn from census data) will be

used to predict whether a person has income more or less than $50K (this is 1990 data). The fields are

as follows:

AGE: Age of person

EMPLOYER: area of employment (government, private, etc.)

DEGREE: Highest academic degree

MSTATUS: Marital Status

JOBTYPE: type of job (clerical, cleaners, etc.)

SEX: male or female

C-GAIN: capital gain claimed on taxes last year

C-LOSS: capital loss claimed on taxes last year

HOURS: average hours per week worked

COUNTRY: country of origin

INCOME: 50K or > 50K

Use R to answer the questions below.

(a) Check if there are any missing values or outliers in the dataset. If there are, how do you handle

them? (simply include your R code. No need for explanations)

(b) How does income vary for different ages? (You can draw histogram or boxplot of age for different

classes of income and compare the results).

(c) How does income vary at different countries?

(d) How does income vary between men and women?

(e) What is the income of most female under the age of 30 who work for a private sector? Is it

more than 50K or less than 50K?

(f) Divide the data into 60% training and 40% testing sets.

(g) Create the default C&R decision tree. How many leaves are in the tree?

(h) What are the major predictors of INCOME? Justify your choice. How can you get this information from the software?

(i) Give three rules that describe who is likely to have an INCOME > 50K and who is likely to have

an income 50K. These rules should be relevant (support at least 5% in the training sample)

and strong (either confidence more than 75% "> 50K" or 90% " 50K"). If there are no three

rules that meet these criteria, give the three best rules you can.

(j) Create two more C&R trees. The first is just like the default tree except you do not "prune tree

to avoid overtting" (you need to let the model to grow to its full depth). The other does prune,

but you require 500 records in a parent branch and 100 records in a child branch. You can also

play with the complexity parameter. How do the three trees differ? Briefly explain.

(k) Which of your tree models seems most accurate on the training data? Which seems most

accurate on the test data?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

An Introduction to Analysis

Authors: William R. Wade

4th edition

132296381, 978-0132296380

More Books

Students also viewed these Mathematics questions