Assignment 2 In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models Part 1 of this assignment will look at regression and Part 2 will look at classification Part 1 Regression First, run the following block to set up the variables needed for later sections In import numpy as np import pandas as pd from sklearn model selection import train test split np random seed(0) n 15 x np linspace(0,10,n) np random randn(n) 5 y np sin(x) x 6 np random randn(n) 10 X train, X test, y train, y test train test split(x, y, random state 0) You can use this function to help you visualize the dataset by plotting a scatterplot of the data points in the training and test sets def part1 scatter() import matplotlib pyplot as plt matplotlib notebook plt figure() plt scatter(X train, y train, label 'training data') plt scatter(X test, y test, label 'test data') plt legend(loc 4) NOTE Uncomment the function below to visualize the data, but be sure to re comment it before submitting this assignment to the autograder part1 scatter() Question 1 Write a function that fits a polynomial LinearRegression model on the training data X train for degrees 1, 3, 6, and 9 (Use PolynomialFeatures in sklearn preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x 0 to 10 (e g np linspace(0,10,100)) and store this in a numpy array The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9 The figure above shows the fitted models plotted on top of the original data (using plot one()) This function should return a numpy array with shape (4, 100) Question 2 Write a function that fits a polynomial LinearRegression model on the training data X train for degrees 0 through 9 For each model compute the R2R2 (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple This function should return one tuple of numpy arrays (r2 train, r2 test) Both arrays should have shape (10,) Question 3 Based on the R2R2 scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting What degree level corresponds to a model that is overfitting What choice of degree level would provide a model with good generalization performance on this dataset Hint Try plotting the R2R2 scores from question 2 to visualize the relationship between degree level and R2R2 Remember to comment out the import matplotlib line before submission This function should return one tuple with the degree values in this order (Underfitting, Overfitting, Good Generalization) There might be multiple correct solutions, however, you only need to return one possible solution, for example, (1,2,3) Question 4 Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression For this question, train two models a non regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters alpha 0 01, max iter 10000) both on polynomial features of degree 12 Return the R2R2 score for both the LinearRegression and Lasso model's test sets This function should return one tuple (LinearRegression R2 test score, Lasso R2 test score) Part 2 Classification Here's an application of machine learning that could save your life For this section of the assignment we will be working with the UCI Mushroom Data Set stored in readonly mushrooms csv The data will be used to train a model to predict whether or not a mushroom is poisonous The following attributes are provided Attribute Information cap shape bell b, conical c, convex x, flat f, knobbed k, sunken s cap surface fibrous f, grooves g, scaly y, smooth s cap color brown n, buff b, cinnamon c, gray g, green r, pink p, purple u, red e, white w, yellow y bruises bruises t, no f odor almond a, anise l, creosote c, fishy y, foul f, musty m, none n, pungent p, spicy s gill attachment attached a, descending d, free f, notched n gill spacing close c, crowded w, distant d gill size broad b, narrow n gill color black k, brown n, buff b, chocolate h, gray g, green r, orange o, pink p, purple u, red e, white w, yellow y stalk shape enlarging e, tapering t stalk root bulbous b, club c, cup u, equal e, rhizomorphs z, rooted r, missing stalk surface above ring fibrous f, scaly y, silky k, smooth s stalk surface below ring fibrous f, scaly y, silky k, smooth s stalk color above ring brown n, buff b, cinnamon c, gray g, orange o, pink p, red e, white w, yellow y stalk color below ring brown n, buff b, cinnamon c, gray g, orange o, pink p, red e, white w, yellow y veil type partial p, universal u veil color brown n, orange o, white w, yellow y ring number none n, one o, two t ring type cobwebby c, evanescent e, flaring f, large l, none n, pendant p, sheathing s, zone z spore print color black k, brown n, buff b, chocolate h, green r, orange o, purple u, white w, yellow y population abundant a, clustered c, numerous n, scattered s, several v, solitary y habitat grasses g, leaves l, meadows m, paths p, urban u, waste w, woods d The data in the mushrooms dataset is currently encoded with strings These values will need to be encoded to numeric to work with sklearn We'll use pd get dummies to convert the categorical variables into indicator variables Question 5 Using X train2 and y train2 from the preceeding cell, train a DecisionTreeClassifier with default parameters and random state 0 What are the 5 most important features found by the decision tree As a reminder, the feature names are available in the X train2 columns property, and the order of the features in X train2 columns matches the order of the feature importance values in the classifier's feature importances property This function should return a list of length 5 containing the feature names in descending order of importance Note remember that you also need to set random state in the DecisionTreeClassifier Question 6 For this question, we're going to use the validation curve function in sklearn model selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values Recall that the validation curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train test splits to compute results Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset please use the variables X subset and y subset as input to the validation curve function (instead of X mush and y mush) to reduce computation time The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel So your first step is to create an SVC object with default parameters (i e kernel 'rbf', C 1) and random state 0 Recall that the kernel width of the RBF kernel is controlled using the gamma parameter With this classifier, and the dataset in X subset, y subset, explore the effect of gamma on classifier accuracy by using the validation curve function to find the training and test scores for 6 values of gamma from 0 0001 to 10 (i e np logspace( 4,1,6)) Recall that you can specify what scoring metric you want validation curve to use by setting the scoring parameter In this case, we want to use accuracy as the scoring metric For each level of gamma, validation curve will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets Find the mean score across the three models for each level of gamma for both arrays, creating two arrays of length 6, and return a tuple with the two arrays e g if one of your array of scores is array( 0 5, 0 4, 0 6 , 0 7, 0 8, 0 7 , 0 9, 0 8, 0 8 , 0 8, 0 7, 0 8 , 0 7, 0 6, 0 6 , 0 4, 0 6, 0 5 ) it should then become array( 0 5, 0 73333333, 0 83333333, 0 76666667, 0 63333333, 0 5 ) This function should return one tuple of numpy arrays (training scores, test scores) where each array in the tuple has shape (6,) Question 7 Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy) What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy) What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set) Hint Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy Remember to comment out the import matplotlib line before submission This function should return one tuple with the degree values in this order (Underfitting, Overfitting, Good Generalization) Please note there is only one correct solution

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 21, 2024

Assignment 2 In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part

Assignment 2

In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.

Part 1 - Regression

First, run the following block to set up the variables needed for later sections.

In [ ]:

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

np.random.seed(0)

n = 15

x = np.linspace(0,10,n) + np.random.randn(n)/5

y = np.sin(x)+x/6 + np.random.randn(n)/10

X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)

# You can use this function to help you visualize the dataset by

# plotting a scatterplot of the data points

# in the training and test sets.

def part1_scatter():

 import matplotlib.pyplot as plt

 %matplotlib notebook

 plt.figure()

 plt.scatter(X_train, y_train, label='training data')

 plt.scatter(X_test, y_test, label='test data')

 plt.legend(loc=4);

# NOTE: Uncomment the function below to visualize the data, but be sure

# to **re-comment it before submitting this assignment to the autograder**.

#part1_scatter()

Question 1

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.

The figure above shows the fitted models plotted on top of the original data (using plot_one()).

This function should return a numpy array with shape (4, 100)

Question 2

Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 0 through 9. For each model compute the R2R2 (coefficient of determination) regression score on the training data as well as the the test data, and return both of these arrays in a tuple.

This function should return one tuple of numpy arrays (r2_train, r2_test). Both arrays should have shape (10,)

Question 3

Based on the R2R2 scores from question 2 (degree levels 0 through 9), what degree level corresponds to a model that is underfitting? What degree level corresponds to a model that is overfitting? What choice of degree level would provide a model with good generalization performance on this dataset?

Hint: Try plotting the R2R2 scores from question 2 to visualize the relationship between degree level and R2R2. Remember to comment out the import matplotlib line before submission.

This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization). There might be multiple correct solutions, however, you only need to return one possible solution, for example, (1,2,3).

Question 4

Training models on high degree polynomial features can result in overly complex models that overfit, so we often use regularized versions of the model to constrain model complexity, as we saw with Ridge and Lasso linear regression.

For this question, train two models: a non-regularized LinearRegression model (default parameters) and a regularized Lasso Regression model (with parameters alpha=0.01, max_iter=10000) both on polynomial features of degree 12. Return the R2R2 score for both the LinearRegression and Lasso model's test sets.

This function should return one tuple (LinearRegression_R2_test_score, Lasso_R2_test_score)

Part 2 - Classification

Here's an application of machine learning that could save your life! For this section of the assignment we will be working with the UCI Mushroom Data Set stored in readonly/mushrooms.csv. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

Attribute Information:

cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
bruises?: bruises=t, no=f
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
gill-attachment: attached=a, descending=d, free=f, notched=n
gill-spacing: close=c, crowded=w, distant=d
gill-size: broad=b, narrow=n
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
stalk-shape: enlarging=e, tapering=t
stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
veil-type: partial=p, universal=u
veil-color: brown=n, orange=o, white=w, yellow=y
ring-number: none=n, one=o, two=t
ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables.

Question 5

Using X_train2 and y_train2 from the preceeding cell, train a DecisionTreeClassifier with default parameters and random_state=0. What are the 5 most important features found by the decision tree?

As a reminder, the feature names are available in the X_train2.columns property, and the order of the features in X_train2.columns matches the order of the feature importance values in the classifier's feature_importances_ property.

This function should return a list of length 5 containing the feature names in descending order of importance.

Note: remember that you also need to set random_state in the DecisionTreeClassifier.

Question 6

For this question, we're going to use the validation_curve function in sklearn.model_selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values. Recall that the validation_curve function, in addition to taking an initialized unfitted classifier object, takes a dataset as input and does its own internal train-test splits to compute results.

Because creating a validation curve requires fitting multiple models, for performance reasons this question will use just a subset of the original mushroom dataset: please use the variables X_subset and y_subset as input to the validation curve function (instead of X_mush and y_mush) to reduce computation time.

The initialized unfitted classifier object we'll be using is a Support Vector Classifier with radial basis kernel. So your first step is to create an SVC object with default parameters (i.e. kernel='rbf', C=1) and random_state=0. Recall that the kernel width of the RBF kernel is controlled using the gamma parameter.

With this classifier, and the dataset in X_subset, y_subset, explore the effect of gamma on classifier accuracy by using the validation_curve function to find the training and test scores for 6 values of gamma from 0.0001 to 10 (i.e. np.logspace(-4,1,6)). Recall that you can specify what scoring metric you want validation_curve to use by setting the "scoring" parameter. In this case, we want to use "accuracy" as the scoring metric.

For each level of gamma, validation_curve will fit 3 models on different subsets of the data, returning two 6x3 (6 levels of gamma x 3 fits per level) arrays of the scores for the training and test sets.

Find the mean score across the three models for each level of gamma for both arrays, creating two arrays of length 6, and return a tuple with the two arrays.

e.g.

if one of your array of scores is

array([[ 0.5, 0.4, 0.6], [ 0.7, 0.8, 0.7], [ 0.9, 0.8, 0.8], [ 0.8, 0.7, 0.8], [ 0.7, 0.6, 0.6], [ 0.4, 0.6, 0.5]])

it should then become

array([ 0.5, 0.73333333, 0.83333333, 0.76666667, 0.63333333, 0.5])

This function should return one tuple of numpy arrays (training_scores, test_scores) where each array in the tuple has shape (6,).

Question 7

Based on the scores from question 6, what gamma value corresponds to a model that is underfitting (and has the worst test set accuracy)? What gamma value corresponds to a model that is overfitting (and has the worst test set accuracy)? What choice of gamma would be the best choice for a model with good generalization performance on this dataset (high accuracy on both training and test set)?

Hint: Try plotting the scores from question 6 to visualize the relationship between gamma and accuracy. Remember to comment out the import matplotlib line before submission.

This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.