Question
use the code r Script below to Answer the questions from number 3 to 7 Questions : 3. Model #1 - First Logistic Regression Model
use the code r Script below to Answer the questions from number 3 to 7
Questions :
3. Model #1 - First Logistic Regression Model
Reporting Results
Report the results of the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Write the general form and the prediction equationof the logistic multiple regression model for heart disease (target) using variables age (age), resting blood pressure (trestbps), exercised induced angina (exang),and maximum heart rate achieved (thalach). Note: Use the equation editor to write the regression equation.
- Now write the prediction model equation in terms of the natural log of odds to express the beta terms in linear form. Note: Use the equation editor to write the regression equation.
- What do the following terms, from the general form of the model above, mean in terms of an individual having a heart disease?
- Create the logistic regression model. Write the prediction model equation (in terms of the natural log of odds) using outputs obtained from your R script. Round all figures to four decimal places.
- Interpret the estimated coefficient of the maximum heart rate achieved variable.
Evaluating Model Significance
Evaluate the significance of the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Perform the Hosmer-Lemeshow goodness of fit test to assess whether the model is appropriate for the data set. Identify the null and alternative hypotheses, the test statistic, and the P-value. Use a 5% level of significance.
- Which terms are significant in the model based on Wald's test? Use a 5% level of significance.
- Obtain the confusion matrix and report the counts for true positives, true negatives, false positives, and false negatives.
- Report the following:
- Accuracy
- Precision
- Recall
- Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and explain what it illustrates.
- What is the value of AUC? Interpret what this value represents.
Making Predictions Using Model
Make predictions using the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- What is the probability of an individual having heart disease who is 50 years old, has a resting blood pressure of 122, has exercise induced angina, and has maximum heart rate of 140?Find the odds of this event occurring.
- What is the probability of an individual having heart disease who is 50 years old, has a resting blood pressure of 130, does not have an exercise induced angina, and has maximum heart rate of 165?Find the odds of this event occurring.
- Comment on the two predictions. What can be deduced based on the probabilities and the odds?
4. Model #2 - Second Logistic Regression Model
Reporting Results
Report the results of the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Write the general form and the prediction equation of the logistic multiple regression model for heart disease (target) using variables age (age), resting blood pressure (trestbps), type of chest pain experienced (cp), maximum heart rate achieved (thalach); Include the quadratic term for age and the interaction term between age and maximum heart rate achieved. Note that this general form should be written in terms of E(y), exponents and(where i1, 2, ... ). Note: Use the equation editor to write the regression equation.
- Now write the prediction equation of this model in terms of the natural log of odds to express the beta terms in linear form. Note: Use the equation editor to write the regression equation.
- Create the logistic regression model. Write the prediction model equation (in terms of the natural log of odds) using outputs obtained from your R script. Round all figures to four decimal places.
Evaluating Model Significance
Evaluate the significance of the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Perform the Hosmer-Lemeshow goodness of fit test to assess whether the model is appropriate for the data set. Identify the null and alternative hypotheses, the test statistic, and the P-value. Use a 5% level of significance.
- Which terms are significant in the model based on Wald's test? Use a 5% level of significance.
- Obtain the confusion matrix and report the counts for true positives, true negatives, false positives, and false negatives.
- Report the following:
- Accuracy
- Precision
- Recall
- Obtain the Receiver Operating Characteristic (ROC) curve. Interpret the graph and explain what it illustrates.
- What is the value of AUC? Interpret what this value represents.
.
Making Predictions Using Model
Make predictions using the regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- What is the probability of an individual having heart disease who is 50 years old, has a resting blood pressure of 115, does not experience chest pain, and has maximum heart rate of 133? Find the odds of this event occurring.
- What is the probability of an individual having heart disease who is 50 years old, has a resting blood pressure of 125, experiences typical angina, and has maximum heart rate of 155? Find the odds of this event occurring.
- Comment on the two predictions. What can be deduced based on the probabilities and the odds?
5. Random Forest Classification Model
Reporting Results
Report the results of the random forest classification model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Split the heart disease data set into training and testing sets using 85% and 15% split, respectively. Use set.seed(6522048). How many rows are in the original data set, and in the training and validation sets?
- Graph the training and testing error against the number of trees using a classification random forest model for the presence of heart disease (target) using variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), and number of major vessels (ca). Use a maximum of 150 trees. Use set.seed(6522048).
- What is the optimal number of trees for the random forest model?
Evaluating the Utility of the model
Evaluate the utility of the random forest classification model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Using the appropriate number of trees found, create the classification random forest model for the presence of heart disease (target) using variables age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), and number of major vessels (ca). Obtain the confusion matrix for the training set and report the accuracy, precision, and recall.
- Obtain the confusion matrix for the testing set and report the accuracy, precision, and recall.
6. Random Forest Regression Model
Reporting Results
Report the results of the random forest regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Split the heart disease data set into training and testing sets using 80% and 20% split, respectively. Use set.seed(6522048). How many rows are in the original data set, and the training and validation sets?
- Graph the mean squared error against the number of trees for a random forest regression model for maximum heart rate achieved using age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), and number of major vessels (ca). Use a maximum of 80 trees. Use set.seed(6522048).
- What is the optimal number of trees for the random forest model?
Evaluating the Utility of the Random Forest Regression Model
Evaluate the utility of the random forest regression model. Address the following questions in your analysis. Round all numbers to four decimal places.
- Using the appropriate number of trees found, create the random forest regression model for maximum heart rate achieved using age (age), sex (sex), chest pain type (cp), resting blood pressure (trestbps), cholesterol measurement (chol), resting electrocardiographic measurement (restecg), exercise-induced angina (exang), and number of major vessels (ca).
- What is the root mean squared error for the training set?
- What is the root mean squared error for the testing set?
7. Conclusion
Describe the results of the statistical analyses clearly, using proper descriptions of statistical terms and concepts. Fully describe what these results mean for your scenario.
- Which of the two logistic regression models would you choose to predict heart disease? Briefly summarize your findings in plain language.
- Would you recommend using the random forest classification model instead of the logistic regression model? Why or why not?
- What is the practical importance of the analyses that were performed?
code:
Project Two: Logistic Regression and Random Forests
For Project Two, you have been asked to create different models analyzing a Heart Disease data set. Before beginning work on the project, be sure to read through the Project Two Guidelines and Rubric to understand what you need to do and how you will be graded on this assignment. Be sure to carefully review the Project Two Summary Report template, which contains all of the questions that you will need to answer about the regression analyses you are performing.
For this project, you will be writing all the scripts yourself. You may reference the textbook and your previous work on the problem sets to help you write the scripts.
Scenario
You are a data analyst researching risk factors for heart disease at a university hospital. You have access to a large set of historical data that you can use to analyze patterns between different health indicators (e.g. fasting blood sugar, maximum heart rate, etc.) and the presence of heart disease. You have been asked to create different logistic regression models that predict whether or not a person is at risk for heart disease. A model like this could eventually be used to evaluate medical records and look for risks that might not be obvious to human doctors. You have also been asked to create a classification random forest model to predict the risk of heart disease and a regression random forest model to predict the maximum heart rate achieved.
There are several variables in this data set, but you will be working with the following important variables:
Variable | What does it represent? |
---|---|
age | The person's age in years |
sex | The person's sex (1 = male, 0 = female) |
cp | The type of chest pain experienced (0=no pain, 1=typical angina, 2=atypical angina, 3=non-anginal pain) |
trestbps | The person's resting blood pressure |
chol | The person's cholesterol measurement in mg/dl |
fbs | The person's fasting blood sugar is greater than 120 mg/dl (1 = true, 0 = false) |
restecg | Resting electrocardiographic measurement (0=normal, 1=having ST-T wave abnormality, 2=showing probable or definite left ventricular hypertrophy by Estes' criteria) |
thalach | The person's maximum heart rate achieved |
exang | Exercise-induced angina (1=yes, 0=no) |
oldpeak | ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot) |
slope | The slope of the peak exercise ST segment (1=upsloping, 2=flat, 3=downsloping) |
ca | The number of major vessels (0-3) |
target | Heart disease (0=no, 1=yes) |
Install Libraries
In the following code block, you will install appropriate libraries to use in this project.
Click theRunbutton on the toolbar to run this code.
Note:The code section below will first install three R packages: "ResourceSelection", "pROC" and "rpart.plot". Please do not move to the next step until the packages are fully installed. This will take some time. Once the installation is complete, this step will print first 6 rows of the data set.
In[1]:
print("This step will first install three R packages. Please wait until the packages are fully installed.")print("Once the installation is complete, this step will print 'Installation complete!'")install.packages("ResourceSelection")install.packages("pROC")install.packages("rpart.plot")print("Installation complete!")
[1] "This step will first install three R packages. Please wait until the packages are fully installed." [1] "Once the installation is complete, this step will print 'Installation complete!'"
Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4' (as 'lib' is unspecified) Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4' (as 'lib' is unspecified) Installing package into '/home/codio/R/x86_64-pc-linux-gnu-library/3.4' (as 'lib' is unspecified)
[1] "Installation complete!"
Prepare Your Data Set
In the following code block, you have been given the R code to prepare your data set.
Click theRunbutton on the toolbar to run this code.
In[3]:
heart_data<-read.csv(file="heart_disease.csv", header=TRUE, sep=",")# Converting appropriate variables to factors heart_data<-within(heart_data, { target<-factor(target) sex<-factor(sex) cp<-factor(cp) fbs<-factor(fbs) restecg<-factor(restecg) exang<-factor(exang) slope<-factor(slope) ca<-factor(ca) thal<-factor(thal)})head(heart_data, 10)print("Number of variables")ncol(heart_data)print("Number of rows")nrow(heart_data)
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
62 | 1 | 2 | 130 | 231 | 0 | 1 | 146 | 0 | 1.8 | 1 | 3 | 3 | 1 |
58 | 0 | 0 | 130 | 197 | 0 | 1 | 131 | 0 | 0.6 | 1 | 0 | 2 | 1 |
60 | 0 | 3 | 150 | 240 | 0 | 1 | 171 | 0 | 0.9 | 2 | 0 | 2 | 1 |
63 | 1 | 0 | 140 | 187 | 0 | 0 | 144 | 1 | 4.0 | 2 | 2 | 3 | 0 |
62 | 1 | 0 | 120 | 267 | 0 | 1 | 99 | 1 | 1.8 | 1 | 2 | 3 | 0 |
63 | 0 | 2 | 135 | 252 | 0 | 0 | 172 | 0 | 0.0 | 2 | 0 | 2 | 1 |
43 | 1 | 0 | 150 | 247 | 0 | 1 | 171 | 0 | 1.5 | 2 | 0 | 2 | 1 |
42 | 1 | 2 | 120 | 240 | 1 | 1 | 194 | 0 | 0.8 | 0 | 0 | 3 | 1 |
59 | 1 | 2 | 126 | 218 | 1 | 1 | 134 | 0 | 2.2 | 1 | 1 | 1 | 0 |
48 | 1 | 0 | 124 | 274 | 0 | 0 | 166 | 0 | 0.5 | 1 | 0 | 3 | 0 |
[1] "Number of variables"
14
[1] "Number of rows"
303
Model #1 - First Logistic Regression Model
You have been asked to create the logistic regression model for heart disease(target)using the variables age(age), resting blood pressure(trestbps), exercised induced angina (exang) and maximum heart rate achieved(thalach). Before writing any code, review Section 3 of the Summary Report template to see the questions you will be answering about your logistic regression model.
Run your scripts to get the outputs of your regression analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
In[4]:
# Create the first logistic regression modelprint("Logistic regression model 1")logit1<-glm(target~age+trestbps+exang+thalach, data=heart_data, family="binomial")summary(logit1)
[1] "Logistic regression model 1"
Call: glm(formula = target ~ age + trestbps + exang + thalach, family = "binomial", data = heart_data) Deviance Residuals: Min 1Q Median 3Q Max -2.0935 -0.7944 0.4954 0.8133 2.2343 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.021121 1.784194 -0.572 0.5671 age -0.017549 0.017144 -1.024 0.3060 trestbps -0.014888 0.008337 -1.786 0.0741 . exang1 -1.624981 0.305774 -5.314 1.07e-07 *** thalach 0.031095 0.007275 4.274 1.92e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 417.64 on 302 degrees of freedom Residual deviance: 323.14 on 298 degrees of freedom AIC: 333.14 Number of Fisher Scoring iterations: 4
In[5]:
# Install the ResourceSelection package if not already installedif(!require(ResourceSelection)) { install.packages("ResourceSelection")}# Load the ResourceSelection packagelibrary(ResourceSelection)
Loading required package: ResourceSelection ResourceSelection 0.3-6 2023-06-27
In[6]:
# Hosmer-Lemeshow Goodness of Fit Testprint("Hosmer-Lemeshow Goodness of Fit Test")hl1=hoslem.test(logit1$y, fitted(logit1), g=10)hl1# Prediction for logistic regression model 1default_model_data1<-heart_data[c('age', 'trestbps', 'exang', 'thalach')]pred1<-predict(logit1, newdata=default_model_data1, type='response')
[1] "Hosmer-Lemeshow Goodness of Fit Test"
Hosmer and Lemeshow goodness of fit (GOF) test data: logit1$y, fitted(logit1) X-squared = 9.192, df = 8, p-value = 0.3264
In[7]:
# Install and load the pROC package if not already installedif(!require(pROC)) { install.packages("pROC")}# Load the pROC packagelibrary(pROC)
Loading required package: pROC Type 'citation("pROC")' for a citation. Attaching package: 'pROC' The following objects are masked from 'package:stats': cov, smooth, var
In[15]:
# Create confusion matrix for logistic regression model 1depvar_pred1=as.factor(ifelse(pred1>=0.5, '1', '0'))conf.matrix1<-table(heart_data$target, depvar_pred1)print("Confusion Matrix for Model 1")print(conf.matrix1)# ROC and AUC for logistic regression model 1roc1<-roc(heart_data$target, pred1)print("AUC for Model 1")round(auc(roc1), 4)plot(roc1, legacy.axes=TRUE, main="ROC Curve for Model 1")
[1] "Confusion Matrix for Model 1" depvar_pred1 0 1 0 89 49 1 31 134
Setting levels: control = 0, case = 1 Setting direction: controls < cases
[1] "AUC for Model 1"
0.8007
In[8]:
# Predictions using logistic regression model 1newdata1<-data.frame(age=50, trestbps=122, exang=factor(1, levels=c(0, 1)), thalach=140)print("Prediction for individual 1 using Model 1")round(predict(logit1, newdata1, type='response'), 4)newdata2<-data.frame(age=50, trestbps=130, exang=factor(0, levels=c(0, 1)), thalach=165)print("Prediction for individual 2 using Model 1")round(predict(logit1, newdata2, type='response'), 4)
[1] "Prediction for individual 1 using Model 1"
1:0.2716
[1] "Prediction for individual 2 using Model 1"
1:0.7853
Model #2 - Second Logistic Regression Model
You have been asked to create a logistic regression model for heart disease(target)using the variables age of the individual(age), resting blood pressure (trestbps), type of chest pain(cp)and maximum heart rate achieved(thalach). You also have to include the quadratic term for age and the interaction term between age and maximum heart rate achieved. Before writing any code, review Section 4 of the Summary Report template to see the questions you will be answering about your model.
Run your scripts to get the outputs of your analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
In[9]:
# Create thee second logistic regression modelprint("Logistic regression model 2")logit2<-glm(target~age+trestbps+cp+thalach+I(age^2)+age:thalach, data=heart_data, family="binomial")summary(logit2)
[1] "Logistic regression model 2"
Call: glm(formula = target ~ age + trestbps + cp + thalach + I(age^2) + age:thalach, family = "binomial", data = heart_data) Deviance Residuals: Min 1Q Median 3Q Max -2.6961 -0.7537 0.2925 0.7123 2.3058 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.556e+01 1.054e+01 -1.476 0.13988 age 1.744e-01 2.669e-01 0.653 0.51357 trestbps -1.958e-02 8.978e-03 -2.181 0.02916 * cp1 1.913e+00 4.437e-01 4.313 1.61e-05 *** cp2 2.037e+00 3.473e-01 5.867 4.45e-09 *** cp3 1.777e+00 5.477e-01 3.245 0.00117 ** thalach 1.363e-01 5.119e-02 2.663 0.00775 ** I(age^2) 8.424e-04 1.750e-03 0.481 0.63025 age:thalach -1.867e-03 8.909e-04 -2.095 0.03616 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 417.64 on 302 degrees of freedom Residual deviance: 293.67 on 294 degrees of freedom AIC: 311.67 Number of Fisher Scoring iterations: 5
In[10]:
# Hosmer-Lemeshow Goodness of Fit Testprint("Hosmer-Lemeshow Goodness of Fit Test")hl2=hoslem.test(logit2$y, fitted(logit2), g=10)hl2# Prediction for logistic regression model 2default_model_data2<-heart_data[c('age', 'trestbps', 'cp', 'thalach')]pred2<-predict(logit2, newdata=default_model_data2, type='response')
[1] "Hosmer-Lemeshow Goodness of Fit Test"
Hosmer and Lemeshow goodness of fit (GOF) test data: logit2$y, fitted(logit2) X-squared = 6.0481, df = 8, p-value = 0.6418
In[20]:
# Create confusion matrix for logistic regression model 2depvar_pred2=as.factor(ifelse(pred2>=0.5, '1', '0'))conf.matrix2<-table(heart_data$target, depvar_pred2)print("Confusion Matrix for Model 2")print(conf.matrix2)# ROC and AUC for logistic regression model 2roc2<-roc(heart_data$target, pred2)print("AUC for Model 2")round(auc(roc2), 4)plot(roc2, legacy.axes=TRUE, main="ROC Curve for Model 2")
[1] "Confusion Matrix for Model 2" depvar_pred2 0 1 0 102 36 1 36 129
Setting levels: control = 0, case = 1 Setting direction: controls < cases
[1] "AUC for Model 2"
0.8478
In[11]:
# Predictions using logistic regression model 2newdata3<-data.frame(age=50, trestbps=115, cp=factor(0, levels=c(0, 1, 2, 3)), thalach=133)print("Prediction for individual 1 using Model 2")print(round(predict(logit2, newdata3, type='response'), 4))newdata4<-data.frame(age=50, trestbps=125, cp=factor(1, levels=c(0, 1, 2, 3)), thalach=155)print("Prediction for individual 2 using Model 2")print(round(predict(logit2, newdata4, type='response'), 4))
[1] "Prediction for individual 1 using Model 2" 1 0.2188 [1] "Prediction for individual 2 using Model 2" 1 0.8007
Random Forest Classification Model
You have been asked to create a random forest classification model for the presence of heart disease(target)using the variables age(age), sex(sex), chest pain type(cp), resting blood pressure(trestbps), cholesterol measurement(chol), resting electrocardiographic measurement(restecg), exercise-induced angina(exang), and number of major vessels(ca). Before writing any code, review Section 5 of the Summary Report template to see the questions you will be answering about your model.
Run your scripts to get the outputs of your regression analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
In[16]:
# Installing necessary packages if not already installedif(!require("randomForest")) install.packages("randomForest", dependencies=TRUE)
Loading required package: randomForest randomForest 4.6-14 Type rfNews() to see new features/changes/bug fixes.
In[17]:
set.seed(6522048)# Partition the dataset into training and test datasamp.size=floor(0.85*nrow(heart_data))# Training setprint("Number of rows for the Training set")train_ind=sample(seq_len(nrow(heart_data)), size=samp.size)train.data=heart_data[train_ind,]nrow(train.data)# Testing setprint("Number of rows for the Testing set")test.data=heart_data[-train_ind,]nrow(test.data)
[1] "Number of rows for the Training set"
257
[1] "Number of rows for the Testing set"
46
In[18]:
# Plotting training and testing errors# Checking#=====================================================================train=c()test=c()trees=c()for(iinseq(from=1, to=150, by=1)) { trees<-c(trees, i) set.seed(6522048) model_rf1<-randomForest(target~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree=i) train.data.predict<-predict(model_rf1, train.data, type="class") conf.matrix1<-table(train.data$target, train.data.predict) train_error=1-(sum(diag(conf.matrix1)))/sum(conf.matrix1) train<-c(train, train_error) test.data.predict<-predict(model_rf1, test.data, type="class") conf.matrix2<-table(test.data$target, test.data.predict) test_error=1-(sum(diag(conf.matrix2)))/sum(conf.matrix2) test<-c(test, test_error)}plot(trees, train, type="l", ylim=c(0,1), col="blue", xlab="Number of Trees", ylab="Classification Error")lines(test, type="l", col="blue")legend('topright',legend=c('training set','testing set'), col=c("red","blue"), lwd=2 )
In[20]:
# Optimal number of treesmodel_rf1<-randomForest(target~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree=20)print('Confusion Matrix: TRAINING set based on Random Forest model built using 20 trees')train.data.predict<-predict(model_rf1, train.data, type="class")conf.matrix1<-table(train.data$target, train.data.predict)[,c('0','1')]rownames(conf.matrix1)<-paste("Actual", rownames(conf.matrix1), sep=": ")colnames(conf.matrix1)<-paste("Prediction", colnames(conf.matrix1), sep=": ")format(conf.matrix1,justify="centre",digit=2)print('Confusion Matrix: TESTING set based on Random Forest model built using 20 trees')test.data.predict<-predict(model_rf1, test.data, type="class")conf.matrix2<-table(test.data$target, test.data.predict)[,c('0','1')]rownames(conf.matrix2)<-paste("Actual", rownames(conf.matrix2), sep=": ")colnames(conf.matrix2)<-paste("Prediction", colnames(conf.matrix2), sep=": ")format(conf.matrix2,justify="centre",digit=2)
[1] "Confusion Matrix: TRAINING set based on Random Forest model built using 20 trees"
Prediction: 0 | Prediction: 1 | |
---|---|---|
Actual: 0 | 120 | 0 |
Actual: 1 | 2 | 135 |
[1] "Confusion Matrix: TESTING set based on Random Forest model built using 20 trees"
Prediction: 0 | Prediction: 1 | |
---|---|---|
Actual: 0 | 12 | 6 |
Actual: 1 | 6 | 22 |
In[]:
Random Forest Regression Model
You have been asked to create a random forest regression model for maximum heart rate achieved using the variables age(age), sex(sex), chest pain type(cp), resting blood pressure(trestbps), cholesterol measurement(chol), resting electrocardiographic measurement(restecg), exercise-induced angina(exang), and number of major vessels(ca). Before writing any code, review Section 6 of the Summary Report template to see the questions you will be answering about your model.
Run your scripts to get the outputs of your analysis. Then use the outputs to answer the questions in your summary report.
Note: Use the + (plus) button to add new code blocks, if needed.
In[21]:
# Random Forest Regression Modelset.seed(6522048)samp.size=floor(0.80*nrow(heart_data))train_ind=sample(seq_len(nrow(heart_data)), size=samp.size)train.data=heart_data[train_ind,]test.data=heart_data[-train_ind,]print("Number of rows for the Training set")nrow(train.data)print("Number of rows for the Testing set")nrow(test.data)
[1] "Number of rows for the Training set"
242
[1] "Number of rows for the Testing set"
61
In[22]:
# Plotting RMSE# Checking#=====================================================================train=c()test=c()trees=c()for(iinseq(from=1, to=80, by=1)) { set.seed(6522048) trees<-c(trees, i) model_rf2<-randomForest(thalach~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree=i) pred<-predict(model_rf2, newdata=train.data, type='response') rmse_train<-sqrt(sum((pred-train.data$thalach)^2)/length(pred)) train<-c(train, rmse_train) pred<-predict(model_rf2, newdata=test.data, type='response') rmse_test<-sqrt(sum((pred-test.data$thalach)^2)/length(pred)) test<-c(test, rmse_test)}plot(trees, train, type="l", col="red", xlab="Number of Trees", ylab="RMSE")lines(trees, test, type="l", col="blue")legend('topright',legend=c('training set','testing set'), col=c("red","blue"), lwd=2 )
In[24]:
# Model with optimal number of treesmodel_rf2<-randomForest(thalach~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, ntree=10)print('Training set: RMSE based on Random Forest model built using 10 trees')pred<-predict(model_rf2, newdata=train.data, type='response')sqrt(sum((pred-train.data$thalach)^2)/length(pred))print('Testing set: RMSE based on Random Forest model built using 10 trees')pred<-predict(model_rf2, newdata=test.data, type='response')sqrt(sum((pred-test.data$thalach)^2)/length(pred))
[1] "Training set: RMSE based on Random Forest model built using 10 trees"
10.5576659248575
[1] "Testing set: RMSE based on Random Forest model built using 10 trees"
19.809910202949
In[25]:
# Decision Tree Modellibrary(rpart)library(rpart.plot)model_dt<-rpart(target~age+sex+cp+trestbps+chol+restecg+exang+slope+ca, data=train.data, method="class")prp(model_dt)train.data.predict<-predict(model_dt, newdata=train.data, type='class')conf.matrix3<-table(train.data$target, train.data.predict)print('Confusion Matrix: TRAINING set based on Decision Tree')rownames(conf.matrix3)<-paste("Actual", rownames(conf.matrix3), sep=": default=")colnames(conf.matrix3)<-paste("Prediction", colnames(conf.matrix3), sep=": default=")format(conf.matrix3,justify="centre",digit=2)test.data.predict<-predict(model_dt, newdata=test.data, type='class')conf.matrix4<-table(test.data$target, test.data.predict)print('Confusion Matrix: TESTING set based on Decision Tree')rownames(conf.matrix4)<-paste("Actual", rownames(conf.matrix4), sep=": default=")colnames(conf.matrix4)<-paste("Prediction", colnames(conf.matrix4), sep=": default=")format(conf.matrix4,justify="centre",digit=2)
[1] "Confusion Matrix: TRAINING set based on Decision Tree"
Prediction: default=0 | Prediction: default=1 | |
---|---|---|
Actual: default=0 | 98 | 16 |
Actual: default=1 | 13 | 115 |
[1] "Confusion Matrix: TESTING set based on Decision Tree"
Prediction: default=0 | Prediction: default=1 | |
---|---|---|
Actual: default=0 | 15 | 9 |
Actual: default=1 | 9 | 28 |
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started