Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

(Machine Learning) In Python3 Don't use preprocessing from sklearn 3.5 Ridge Regression (i.e. Linear Regression with l2 regularization When we have a large number of

(Machine Learning) In Python3

Don't use preprocessing from sklearn

image text in transcribedimage text in transcribedimage text in transcribed

3.5 Ridge Regression (i.e. Linear Regression with l2 regularization When we have a large number of features compared to instances, regularization can help control overfitting. Ridge regression is linear regression with l2 regularization. The regularization term is sometimes called a penalty term. The objective function for ridge regression is J () where is the regularization parameter, which controls the degree of regularization. Note that the bias parameter is being regularized as well. We will address that below I. Compute the gradient of J(9) and write down the expression for updating in the gradient descent algorithm. (Matrix/vector expression - no summations please.) 2. Implement compute_regularized square_loss_gradient. 3. Implement regularized_grad_descent. 4. For regression problems, we may prefer to leave the bias term unregularized. One approach is to change J(0) so that the bias is separated out from the other parameters and left unregular ized. Another approach that can achieve approximately the same thing is to use a very large number B, rather than 1, for the extra bias dimension. Explain why making B large decreases the effective regularization on the bias term, and how we can make that regularization as weak as we like (though not zero) 5. (Optional) Develop a formal statement of the claim in the previous problem, and prove the statement 6. (Optional) Try various values of B to see what performs best in test. 7. Now fix B- 1. Choosing a reasonable step-size (or using backtracking line search), find the that minimizes J(9) over a range of . You should plot the training average square loss and the test average square loss (just the average square loss part, without the reg- ularization, in each case) as a function of . Your goal is to find that gives the mini- mum average square loss on the test set. It's hard to predict what X that will be, so you should start your search very broadly, looking over several orders of magnitude. For exam- ple, X E { 10-7, 10-5, 10-3, 10-, 1, 10, 100). Once you find a range that works better, keep zooming in. You may want to have log(A) on the x-axis rather than [If you like, you may use sklearn to help with the hyperparameter search.] 8. What 0 would you select for deployment and why? ### The gradient of regularized batch gradient descent def compute_regularized_square_loss_gradient (X, y, theta, lambda_reg): Compute the gradient of L2-regularized average square loss function given X, y and theta Args: X - the feature vector, 2D numpy array of size (num_instances, num features) y - the label vector, 1D numpy array of size (num_instances) theta - the parameter vector, 1D numpy array of size (num features) lambda_reg - the regularization coefficient Returns: grad - gradient vector, 1D numpy array of size (num features) ### def Regularized batch gradient descent regularized-grad-descent(x, y, alpha-0 .05, lambda-reg= 10**-2, num-step-1000): Args: X - the feature vector, 2D numpy array of size (num_instances, num features) y-the label vector, 1D numpy array of size (num instances) alpha -step size in gradient descent lambda_reg - the regularization coefficient num step number of steps to run Returns: theta_hist - the history of parameter vector, 2D numpy array of size (num step+1, num features) for instance, theta in step 0 should be theta_hist 0], theta in step (num_step+1) is theta_hist-1] loss hist - the history of average square loss function without the regularization term, 1D numpy array. num instances, num features-X.shape I0], X.shape[1] theta np. zeros (num-features) #Initialize theta theta hist = np. zeros( (num step+1, num features) #Initialize theta hist loss_hist -np.zeros (num step+1) Initialize loss_hist 3.5 Ridge Regression (i.e. Linear Regression with l2 regularization When we have a large number of features compared to instances, regularization can help control overfitting. Ridge regression is linear regression with l2 regularization. The regularization term is sometimes called a penalty term. The objective function for ridge regression is J () where is the regularization parameter, which controls the degree of regularization. Note that the bias parameter is being regularized as well. We will address that below I. Compute the gradient of J(9) and write down the expression for updating in the gradient descent algorithm. (Matrix/vector expression - no summations please.) 2. Implement compute_regularized square_loss_gradient. 3. Implement regularized_grad_descent. 4. For regression problems, we may prefer to leave the bias term unregularized. One approach is to change J(0) so that the bias is separated out from the other parameters and left unregular ized. Another approach that can achieve approximately the same thing is to use a very large number B, rather than 1, for the extra bias dimension. Explain why making B large decreases the effective regularization on the bias term, and how we can make that regularization as weak as we like (though not zero) 5. (Optional) Develop a formal statement of the claim in the previous problem, and prove the statement 6. (Optional) Try various values of B to see what performs best in test. 7. Now fix B- 1. Choosing a reasonable step-size (or using backtracking line search), find the that minimizes J(9) over a range of . You should plot the training average square loss and the test average square loss (just the average square loss part, without the reg- ularization, in each case) as a function of . Your goal is to find that gives the mini- mum average square loss on the test set. It's hard to predict what X that will be, so you should start your search very broadly, looking over several orders of magnitude. For exam- ple, X E { 10-7, 10-5, 10-3, 10-, 1, 10, 100). Once you find a range that works better, keep zooming in. You may want to have log(A) on the x-axis rather than [If you like, you may use sklearn to help with the hyperparameter search.] 8. What 0 would you select for deployment and why? ### The gradient of regularized batch gradient descent def compute_regularized_square_loss_gradient (X, y, theta, lambda_reg): Compute the gradient of L2-regularized average square loss function given X, y and theta Args: X - the feature vector, 2D numpy array of size (num_instances, num features) y - the label vector, 1D numpy array of size (num_instances) theta - the parameter vector, 1D numpy array of size (num features) lambda_reg - the regularization coefficient Returns: grad - gradient vector, 1D numpy array of size (num features) ### def Regularized batch gradient descent regularized-grad-descent(x, y, alpha-0 .05, lambda-reg= 10**-2, num-step-1000): Args: X - the feature vector, 2D numpy array of size (num_instances, num features) y-the label vector, 1D numpy array of size (num instances) alpha -step size in gradient descent lambda_reg - the regularization coefficient num step number of steps to run Returns: theta_hist - the history of parameter vector, 2D numpy array of size (num step+1, num features) for instance, theta in step 0 should be theta_hist 0], theta in step (num_step+1) is theta_hist-1] loss hist - the history of average square loss function without the regularization term, 1D numpy array. num instances, num features-X.shape I0], X.shape[1] theta np. zeros (num-features) #Initialize theta theta hist = np. zeros( (num step+1, num features) #Initialize theta hist loss_hist -np.zeros (num step+1) Initialize loss_hist

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

SQL Server T-SQL Recipes

Authors: David Dye, Jason Brimhall

4th Edition

1484200616, 9781484200612

More Books

Students also viewed these Databases questions

Question

What are the core functions of the universitys HRM department?

Answered: 1 week ago

Question

Identify a set of competencies for tenured faculty

Answered: 1 week ago