Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

[ 5 pts ] Load and preprocess the data using Pandas or Numpy and, if necessary, preprocessing functions from scikit - learn. For this problem

[5 pts] Load and preprocess the data using Pandas or Numpy and, if necessary, preprocessing functions from scikit-learn. For this problem you do not need to normalize or standardize the data. However, you may need to handle missing values by imputing those values based on variable means. Compute and display basic statistics (mean, standard deviation, min, max, etc.) for the variables in the data set. Separate the target attribute for regression. Use scikit-learn to create a 20%-80% randomized split of the data. Set aside the 20% test portion; the 80% training data partition will be used for cross-validation on various tasks specified below.
[10] Perform standard multiple linear regression on data using the implementation for Ch.8 of MLA. Compute the RMSE values on the full training data (the 80% partition). Also, plot the correlation between the predicted and actual values of the target attribute. Display the obtained regression coefficients (weights) and plot them using matplotlib. Finally, perform 10-fold cross-validation on the training partition and compare the cross-validation RMSE to the training RMSE (for cross validation, you should use the KFold module from sklearn.model_selection). Note if you cannot get the book's version working, use scikit-learn Linear Regression instead for a 3pt deduction)
[15 pts] Feature Selection: use the scikit-learn regression model from sklearn.linear_model with a subset of features to perform linear regression. For feature selection, write a script or function that takes as input the training data, target variable; the model; and any other parameters you find necessary, and returns the optimal percentage of the most informative features to use. Your approach should use k-fold cross-validation on the training data (you can use k=5). You can use feature_selection.SelectPercentile to find the most informative variables. Show the list of most informative variables and their weights [Note: since this is regression not classification, you should use feature_selection.f_regression as scoring function rather than chi2). Next, plot the model's mean absolute error values on cross-validation relative to the percentage of selected features (See scikit-learn's metrics.mean_absolute_error ). In order to use cross_validation.cross_val_score with regression you'll need to pass to it scoring='neg_mean_absolute_error' as a parameter. [Hint: for an example of a similar feature selection process please review the class example notebook. Also, review scikit-learn documentation for feature selection.]
[10 pts] Next, using the original train-test split in part (a), perform Ridge Regression and Lasso Regression using the modules from sklearn.linear_model. In each case, perform systematic model selection to identify the optimal alpha parameter. You should create a function that takes as input the data and target variable; the parameter to vary and a list of its values; the model to be trained; and any other relevant input needed to determine the optimal value for the specified parameter. The model selection process should perform k-fold cross validation (k should be a parameter, but you can select k=5 for this problem). For each model, you should also plot the error values on the training and cross-validation splits across the specified values of the alpha parameter. Finally, using the best alpha values, train the model on the full training data and evaluate it on the set-aside test data. Discuss your observations and conclusions, especially about the impact of alpha on bias-variance trade-off. [Hint: for an example of a similar model selection process please review the class example notebook.]
[10 pts] Next, perform regression using Stochastic Gradient Descent for regression. For this part, you should use the SGDRegessor module from sklearn.linear_model. Again, start by a creating randomized 80%-20% train-test split. SGDRegessor requires that features be standardized (with 0 mean and scaled by standard deviation). Prior to fiting the model, perform the scaling using StandardScaler from sklearn.preprocessing. For this problem, perform a grid search (using GridSearchCV from sklearn.grid_search) Your grid search should compare combinations of two penalty parameters ('l2','l1') and different values of alpha (alpha could vary from 0.0001 which is the default to relatively large values, say 10). Using the best parameters, apply the model to the set-aside test data. Finally, perform model selection (similar to part d, above) to find the best "l1_ratio" parameter using SGDRegressor with the "elasticnet" penalty parameter. [Note: "l1_ratio" is The Elastic Net mixing parameter, with 0<= l1_ratio <=1; l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1 penalty; defaults to 0.15.] Using the best mixing ratio, apply the Elastic Net model to the set-aside test data. Provide a summary of your findings from the above experiments.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Current Trends In Database Technology Edbt 2006 Edbt 2006 Workshops Phd Datax Iidb Iiha Icsnw Qlqp Pim Parma And Reactivity On The Web Munich Germany March 2006 Revised Selected Papers Lncs 4254

Authors: Torsten Grust ,Hagen Hopfner ,Arantza Illarramendi ,Stefan Jablonski ,Marco Mesiti ,Sascha Muller ,Paula-Lavinia Patranjan ,Kai-Uwe Sattler ,Myra Spiliopoulou ,Jef Wijsen

2006th Edition

3540467882, 978-3540467885

More Books

Students also viewed these Databases questions

Question

1. Describe the types of power that effective leaders employ

Answered: 1 week ago