Answered step by step
Verified Expert Solution
Question
1 Approved Answer
(Machine Learning) In Python3 Don't use preprocessing from sklearn 3.6 Stochastic Gradient Descent When the training data set is very large, evaluating the gradient of
(Machine Learning) In Python3
Don't use preprocessing from sklearn
3.6 Stochastic Gradient Descent When the training data set is very large, evaluating the gradient of the objective function can take a long time, since it requires looking at each training example to take a single gradient step. When the objective function takes the form of an average of many values, such as (as it does in the empirical risk), stochastic gradient descent (SGD) can be very effective. In SGD, rather than taking - VJ(0) as our step direction, we take - Vfi(0) for some i chosen uniformly at random from {1,... ,m). The approximation is poor, but we will show it is unbiased. In machine learning applications, each fi(0) would be the loss on the ith example (and of course we'd typically write n instead of m, for the number of training points). In practical implementations for ML, the data points are randomly shuffled, and then we sweep through the whole training set one by one, and perform an update for each training example individually. One pass through the data is called an epoch. Note that each epoch of SGD touches as much data as a single step of batch gradient descent. You can use the same ordering for each epoch, though optionally you could investigate whether reshuffling after each epoch affects the convergence speed 1. Show that the obiective function TTL can be written in the form J(0)- the two expressions equivalent. f(0) by giving an expression for f. (0) that makes 2. Show that the stochastic gradient Vf (0), for i chosen uniformly at random from {1,... ,m), is an unbiased estimator of J(). In other words, show that E[ A(0)| J() for any . (Hint: It will be easier, notationally, to prove this for a general J(0) 12mif(9), rather than the specific case of ridge regression. You can start by writing down an expression for E Vf (0)]...) 3. Write down the update rule for 0 in SGD for the ridge regression objective function. 4. Implement stochastic_grad_descent. (Note: You could potentially generalize the code you wrote for batch gradient to handle minibatches of any size, including 1, but this is not necessary 5. Use SGD to find that minimizes the ridge regression objective for the and B that you selected in the previous problem. (If you could not solve the previous problem, choose -10-2 and B-1). Try a few fixed step sizes (at least try .c {0.05, .005). Note that SGD may not converge with fixed step size. Simply note your results. Next try step sizes that decrease with the step number according to the following schedules: .-c and CStep by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started