Question

1 Approved Answer

Posted on Sep 21, 2024

The file BostonHousing.csv below contains information collected by the US Bureau of the Censusconcerning housing in the area of Boston, Massachusetts. You will use it

The file BostonHousing.csv below contains information collected by the US Bureau of the Censusconcerning housing in the area of Boston, Massachusetts. You will use it to practice data mining in R.The dataset includes information on 506 census housing tracts in the Boston area.

The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset below contains 13 predictors, and the response is the median house price (MEDV).

*1) why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?

*2) Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

*3) Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

*4) Reduce the number of predictors:

* a) Which predictors are likely to be measuring the same thing among the 13 predictors? *Discuss the relationships among INDUS, NOX, and TAX.

* b) Compute the correlation table for the 12 numerical predictors and search for highly *correlated pairs. These have potential redundancy and can cause multicollinearity. *Choose which ones to remove based on the above table

c) Use stepwise regression with the three options (backward, forward, both) to reduce the remaining predictors as follows: Run stepwise on the training set. Choose the top model from each stepwise run. Then use each of these models separately to predict the validation set. Compare RMSE, MAPE, and mean error, as well as lift charts. Finally, describe the best model.

dataset- https://github.com/selva86/datasets/blob/master/BostonHousing.csv

my codes- https://github.com/MyGitHub2120/Housing-Codes

CRIM Per capita crime rate by town

ZN Proportion of residential land zoned for lots over 25,000 ft

INDUS Proportion of nonretail business acres per town

CHAS Charles River dummy variable (= 1 if tract bounds river; = 0otherwise)

NOX Nitric oxide concentration (parts per 10 million)

RM Average number of rooms per dwelling

AGE Proportion of owner-occupied units built prior to 1940

DIS Weighted distances to five Boston employment centers

RAD Index of accessibility to radial highways

TAX Full-value property-tax rate per $10,000

PTRATIO Pupil/teacher ratio by town

LSTAT Percentage lower status of the population

MEDV Median value of owner-occupied homes in $1000s

Have a problem with the last question- stepwise regression with rmse and mape. For some reasons, I installed the Performance Analytics package, and it kept giving me an error of "Error in rmse(validate$medv, model_backward$fitted.values) :

could not find function "rmse" and "Error in mape(validate$medv, model_backward$fitted.values) :

could not find function "mape""