Question: I have an issue with this one- Compare RMSE, MAPE, and mean error- it gave a NaN. I think I know what the problem is

I have an issue with this one- Compare RMSE, MAPE, and mean error- it gave a NaN. I think I know what the problem is but can't figure it out in the process (missing variable when medv is used)

Here is the info

The file BostonHousing.csv below contains information collected by the US Bureau of the Censusconcerning housing in the area of Boston, Massachusetts. You will use it to practice data mining in R.The dataset includes information on 506 census housing tracts in the Boston area.

The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset below contains 13 predictors, and the response is the median house price (MEDV).

*1) why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?

*2) Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

*3) Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

*4) Reduce the number of predictors:

* a) Which predictors are likely to be measuring the same thing among the 13 predictors? *Discuss the relationships among INDUS, NOX, and TAX.

* b) Compute the correlation table for the 12 numerical predictors and search for highly *correlated pairs. These have potential redundancy and can cause multicollinearity. *Choose which ones to remove based on the above table

c) Use stepwise regression with the three options (backward, forward, both) to reduce the remaining predictors as follows: Run stepwise on the training set. Choose the top model from each stepwise run. Then use each of these models separately to predict the validation set. Compare RMSE, MAPE, and mean error, as well as lift charts. Finally, describe the best model.

dataset- https://github.com/selva86/datasets/blob/master/BostonHousing.csv

library(correlation)

library(ggplot2)

library(dplyr)

library(broom)

installed.packages("ggpubr")

installed.packages("Metrics")

installed.packages("PerformanceAnalytics")

installed.packages("DescTools")

library(DescTools)

read.csv(file.choose(BostonHousing))

housing.df<-data.frame(BostonHousing)

# (1) Tell me why should the data be partitioned into training and validation sets? What will the training setbe used for?

# What will the validation set be used for?

# We need to partition the data into the training and validation set for the purpose

# of assessing the generalization of the model; also, it is useful to identify and

# evaluate the relationships between the predictor and predicted variables.

# The training data is used for the purpose of model fitting. And, the validation data

# is used for empirical validation and measures of errors.

#(2) Fit a multiple linear regression model to the median house price (MEDV)

# as a function of CRIM,CHAS, and RM. Write the equation for

# predicting the median house price from the predictors in the model.

# We will split the data 70% for training, and 30% for validation

reg <-lm(MEDV~CRIM+CHAS+RM, data=housing.df)

summary(reg)

# The linear regression model is MEDV= -28.81 + (-.261*CRIM)

# + (3.76*CHAS) + (8.28*RM)

#(3) Using the estimated regression model, what median house price is predicted for a tract in the

# Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

# MEDV= -28.81 + (-.261*.1) + (3.76*0) + (8.28*6)

reg$coef%*%c(1,0.1,0,6)

# The median house price is $20,832.32

# (4)(a) Reduce the number of predictors: Which predictors are likely to be measuring the same thing among the 13 predictors?

# (b) Discuss the relationships among INDUS, NOX, and TAX.

# (a) Some of the predictors are likely to measure the same thing, but in different ways.

# Some of the predictors are : ZN, INDUS, Tax.

# All this provides a proportion related to the area of land, house.

indus=housing.df$INDUS

nox=housing.df$NOX

tax=housing.df$TAX

d=data.frame(indus,nox,tax)

cor(d)

# correlation between indus and nox is .7636; correlation between indus and tax is .7208, and

# correlation between nox and tax is .6680

# (b) There is a high correlation between INDUS, NOX, and TAX as they include

# a higher percentage of non-retail businesses that translate to higher pollution and taxes.

# (c) Compute the correlation table for the 12 numerical predictors

# and search for highly correlated pairs. These have potential redundancy and can cause multi-collinearity.

# Choose which ones to remove based on the above table

crim=housing.df$CRIM

zn=housing.df$ZN

chas=housing.df$CHAS

rm=housing.df$RM

age=housing.df$AGE

dis=housing.df$DIS

rad=housing.df$RAD

ptratio=housing.df$PTRATIO

lstat=housing.df$LSTAT

b=housing.df$b

data=data.frame(crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat)

cor(data)

# There is a high positive correlation between nox and indus = 0.7637

# There is a high positive correlation between rad and tax = 0.91022

# There is a high negative correlation between dis and nox = -0.7692

# We might remove the nox predictor according to the given matrix

# (d) Use stepwise regression with the three options (backward, forward, both)

# to reduce the remaining predictors as follows: Run stepwise on the training set.

# Choose the top model from each stepwise run. Then use each of these models separately to predict the validation set.

# Compare RMSE,MAPE, and mean error, as well as lift charts. Finally, describe the best model.

# Stepwise regression

#The models with minimum AIC are:

# Backward: medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +lstat

# Formard: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + lstat

# Both: medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +lstat

spec = c(train = .7, validate = .3)

df=data.frame(crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,lstat,medv)

df=data.frame(housing.df)

g = sample(cut(

seq(nrow(df)),

nrow(df)*cumsum(c(0,spec)),

labels = names(spec)

))

res = split(df,g)

train=res$train

validate=res$validate

model1=lm(medv~ crim + zn + chas + nox + rm +age+ rad + tax + ptratio + lstat, data=validate)

summary(model1)

step(model1,direction = "backward")

model2=lm(medv~ crim + zn + chas + nox + rm + age + dis + rad + tax + ptratio + lstat,data=validate)

summary(model2)

step(model2, direction = "forward")

model3=lm(medv~crim + zn + chas +indus, nox + rm + dis + rad + tax + ptratio + medv,lstat,data=validate)

summary(model_both)

step(model_both, direction = "both")

model_backward=lm(medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + lstat,data=validate)

summary(model_backward)

rmse(validate$medv,model_backward$fitted.values)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Set Student Name: 1. Answer true or false for each part, and if false, explain your answer. a. The point estimate for the population mean, , of an x distribution is x-bar, computed from a random...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Identify and discuss the benefits of using different types of instructional feedback. Note : You must cite the reference Augmented Feedback How Giving Feedback Influences Learning KEY TERMS absolute...

(1) The Quality of Financial Information Referencing this week?s readings and lecture, describe the quality issues related to reporting revenue. What is the importance of understanding various...

Week 4 Homework Assignment This assignment is due by 6 p.m. eastern time on Sunday, September 17.When you turn it in, please: (1) Include your team number and assignment number in the filename(s)....

5/21/2016 University of Phoenix: Management PRINTED BY: cherylesowell2012@email.phoenix.edu. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without...

1. What are the main problems you see in this case? 2. Fletcher's first attempt to build a research team did not work out as intended. Why? Use material in Competing Values Leadership and the virtue...

Part A: Qualitative Research (30 points) Section 1: Reading, Memo Writing and Categorizing (20 points) This portion of the assignment is designed to help you develop/employ key qualitative research...

Introduction Carver Parks closed his eyes, lowered his head and rubbed his fingers along his brow. His eyes werestrained looking at report after report, but he was determined to find a solution....

the language to do out this code is python. Create a dictionary to count the number of words in the book. Use the stop_words_english to remove any stop words found in the book. Remove all...

Leakage flux in an induction motor is (A) flux that leaks through the machine (B) flux that links both stator and rotor windings (C) flux that links none of the windings (D) flux that links the...

An economy is initially at full employment, but a decrease in planned investment spending (a component of autonomous expenditure) pushes the economy into recession. Assume that the mpc of this...

Referring to Problem 14.41 on page 598, you have decided to analyze whether there are differences in fixed acidity, chlorides, and pH between white wines and red wines (0 = white 1 = red). Using the...

Question 31 Not yet answered Marked out of 1.00 P Flag question The number of items to be selected from the population or universe to constitute a sample. Select one: : a. True O b. False

4. Based on a message you have recently received, analyse it in relation to its three components: a. the information component, b. its relational component and c. its hierarchical component.

3. Give an example of a problem or phenomenon in everyday life where information is presented in an emotional way.

6. How can a message directly influence the interpreter?