Question

1 Approved Answer

Posted on Oct 12, 2020

Libraries You will need scikit-learn to obtain the data and split it into training and test sets, and NumPy to run linear regression. You may

Libraries You will need scikit-learn to obtain the data and split it into training and test sets, and NumPy to run linear regression. You may also wish to use pandas DataFrames to examine and work with the data, but this is not a requirement. Use Matplotlib’s pyplot framework or pandas to visualize your results. Note that you may not use scikit-learn to run linear regression for your answers. You may, however, wish to use the results of sklearn.linear_model.LinearRegression to spot-check the answers you receive from your NumPy code. You may reuse code from the Jupyter notebooks accompanying the textbook and from the documentation for the libraries. All other code and the results of experiments should be your own. Dataset The scikit-learn sklearn.datasets module includes some small datasets for experimentation. In this project we will use the Boston house prices dataset to try and predict the median value of a home given several features of its neighborhood. See the section on scikit-learn in Sergiy Kolesnikov’s blog article Datasets in Python to see how to load this dataset and examine it using pandas DataFrames. Reminder: while you will use scikit-learn to obtain the dataset, your linear regression implementation must use NumPy directly. Experiments Run the following experiments in a Jupyter notebook, performing each action in a code cell and answering each question in a Markdown cell. Load and examine the Boston dataset’s features, target values, and description. Use sklearn.model_selection.train_test_split() to split the features and values into separate training and test sets. Use 80% of the original data as a training set, and 20% for testing. Create a scatterplot of the training set showing the relationship between the feature LSTAT and the target value MEDV. Does the relationship appear to be linear? With LSTAT as X and MEDV as t, use np.linalg.inv() to compute w for the training set. What is the equation for MEDV as a linear function of LSTAT? Use w to add a line to your scatter plot from experiment (3). How well does the model appear to fit the training set? Use w to find the response for each value of the LSTAT attribute in the test set, then compute the test MSE ???? for the model. Now add an x2 column to LSTAT’s x column in the training set, then repeat experiments (4), (5), and (6) for MEDV as a quadratic function of LSTAT. Does the quadratic polynomial do a better job of predicting the values in the test set? Repeat experiment (4) with all 13 input features as X and using np.linalg.lstsq(). (See the Appendix to Linear regression in vector and matrix format for details of why we need to switch away from np.linalg.inv(), and the notes for np.linalg.solve() for why we shouldn’t use that either.) Does adding additional features improve the performance on the test set compared to using only LSTAT? Now add x2 columns for all 13 features, and repeat experiment (8). Does adding quadratic features improve the performance on the test set compared to using only linear features? Compute the training MSE for experiments (8) and (9) and compare it to the test MSE. What explains the difference? Repeat experiments (9) and (10), adding x3 columns in addition to the existing x and x2 columns for each feature. Does the cubic polynomial do a better job of predicting the values in the training set? Does it do a better job of predicting the values in the test set?

Step by Step Solution

★★★★★

3.41 Rating (148 Votes )

There are 3 Steps involved in it

Step: 1

1Load and examine the Boston datasets features target values and description The Boston Dataset is a collection of data points of various features of homes in the city of Boston It includes informatio... blur-text-image