Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on May 15, 2024

The data set we'll be using in the midterm assignment, ClaimsData.csv , is structured to represent a sample of patients in the Medicare program, which

The data set we'll be using in the midterm assignment, ClaimsData.csv

,

is structured to represent a sample of patients in the Medicare program, which provides health insurance to Americans aged

65

and older, as well as some younger people with certain medical conditions. The observations represent a

1 %

random sample of Medicare beneficiaries, limited to those still alive at the end of

2008 .

Our independent variables are from

2008,

and we will be predicting cost in

2009 .

Our independent variables are the patient's age in years at the end of

2008,

and then several binary variables indicating whether or not the patient had diagnosis codes for a particular disease or related disorder in

2008

: alzheimers, arthritis, cancer, chronic obstructive pulmonary disease, or copd, depression, diabetes, heart.failure, ischemic heart disease, or ihd, kidney disease, osteoporosis, and stroke. Each of these variables will take value

1

if the patient had a diagnosis code for the particular disease and value

0

otherwise. Reimbursement

2008

is the total amount of Medicare reimbursements for this patient in

2008 .

And reimbursement

2009

is the total value of all Medicare reimbursements for the patient in

2009 .

Bucket

2008

is the cost bucket the patient fell into in

2008,

and bucket

2009

is the cost bucket the patient fell into in

2009 .

These cost buckets are defined using the thresholds determined by data supplier. So the first cost bucket contains patients with costs less than $

3, 000,

the second cost bucket contains patients with costs between $

3, 000

and $

8, 000,

the third cost bucket contains patients with costs between $

8, 000

and $

19, 000,

and the fourth cost bucket contains patients with costs between $

19, 000

and $

55, 000,

and fifth cost bucket contains patients greater than $

55, 000 .

1

: Calculate the patient number percentages of each bucket by creating a table of the variable bucket

2009

and divide by the number of rows in Claims. Our goal will be to predict the cost bucket the patient fell into in

2009

using a CART model. But before we build our model, we need to split our data into a training set

(

ClaimsTrain

)

and a testing set

(

ClaimsTest

) .

Therefore, load the package caTools, and then set our random seed to

88

so that we all get the same split. And set SplitRatio to be

0.6 .

2

: What is the average age of patients in the training set, ClaimsTrain? Q

3

: What proportion of people in the training set

(

ClaimsTrain

)

had at least one diagnosis code for diabetes? The baseline method would predict that the cost bucket for a patient in

2009

will be the same as it was in

2008 .

It can be calculated by creating a classification matrix to compute the accuracy for the baseline method on the test set. And it is

0.68 .

Our goal will be to create a CART model that has an accuracy higher than

68 %

Build your CART model and name it ClaimsTree. Independent variables you should use: age, arthritis, alzheimers, cancer, copd, depression, diabetes, heart.failure, ihd, kidney, osteoporosis, stroke, bucket