Answered step by step
Verified Expert Solution
Question
1 Approved Answer
[ 5 pts ] Load and preprocess the data using Pandas or Numpy and, if necessary, preprocessing functions from scikit - learn. For this problem
pts Load and preprocess the data using Pandas or Numpy and, if necessary, preprocessing functions from scikitlearn. For this problem you do not need to normalize or standardize the data. However, you may need to handle missing values by imputing those values based on variable means. Compute and display basic statistics mean standard deviation, min, max, etc. for the variables in the data set. Separate the target attribute for regression. Use scikitlearn to create a randomized split of the data. Set aside the test portion; the training data partition will be used for crossvalidation on various tasks specified below.
Perform standard multiple linear regression on data using the implementation for Ch of MLA. Compute the RMSE values on the full training data the partition Also, plot the correlation between the predicted and actual values of the target attribute. Display the obtained regression coefficients weights and plot them using matplotlib. Finally, perform fold crossvalidation on the training partition and compare the crossvalidation RMSE to the training RMSE for cross validation, you should use the KFold module from sklearn.modelselection Note if you cannot get the book's version working, use scikitlearn Linear Regression instead for a pt deduction
pts Feature Selection: use the scikitlearn regression model from sklearn.linearmodel with a subset of features to perform linear regression. For feature selection, write a script or function that takes as input the training data, target variable; the model; and any other parameters you find necessary, and returns the optimal percentage of the most informative features to use. Your approach should use kfold crossvalidation on the training data you can use k You can use featureselection.SelectPercentile to find the most informative variables. Show the list of most informative variables and their weights Note: since this is regression not classification, you should use featureselection.fregression as scoring function rather than chi Next, plot the model's mean absolute error values on crossvalidation relative to the percentage of selected features See scikitlearn's metrics.meanabsoluteerror In order to use crossvalidation.crossvalscore with regression you'll need to pass to it scoring'negmeanabsoluteerror' as a parameter. Hint: for an example of a similar feature selection process please review the class example notebook. Also, review scikitlearn documentation for feature selection.
pts Next, using the original traintest split in part a perform Ridge Regression and Lasso Regression using the modules from sklearn.linearmodel. In each case, perform systematic model selection to identify the optimal alpha parameter. You should create a function that takes as input the data and target variable; the parameter to vary and a list of its values; the model to be trained; and any other relevant input needed to determine the optimal value for the specified parameter. The model selection process should perform kfold cross validation k should be a parameter, but you can select k for this problem For each model, you should also plot the error values on the training and crossvalidation splits across the specified values of the alpha parameter. Finally, using the best alpha values, train the model on the full training data and evaluate it on the setaside test data. Discuss your observations and conclusions, especially about the impact of alpha on biasvariance tradeoff. Hint: for an example of a similar model selection process please review the class example notebook.
pts Next, perform regression using Stochastic Gradient Descent for regression. For this part, you should use the SGDRegessor module from sklearn.linearmodel. Again, start by a creating randomized traintest split. SGDRegessor requires that features be standardized with mean and scaled by standard deviation Prior to fiting the model, perform the scaling using StandardScaler from sklearn.preprocessing. For this problem, perform a grid search using GridSearchCV from sklearn.gridsearch Your grid search should compare combinations of two penalty parameters ll and different values of alpha alpha could vary from which is the default to relatively large values, say Using the best parameters, apply the model to the setaside test data. Finally, perform model selection similar to part d above to find the best lratio" parameter using SGDRegressor with the "elasticnet" penalty parameter. Note: lratio" is The Elastic Net mixing parameter, with lratio ; lratio corresponds to L penalty, lratio to L penalty; defaults to Using the best mixing ratio, apply the Elastic Net model to the setaside test data. Provide a summary of your findings from the above experiments.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started