Answered step by step
Verified Expert Solution
Question
1 Approved Answer
3 Regularization: Ridge and Lasso Frequently, we are interested in finding 'simpler' models over more complex ones. This is essentially Occam's Razor ( simpler models
Regularization: Ridge and Lasso
Frequently, we are interested in finding 'simpler' models over more complex ones. This is essentially Occam's Razor simpler models better models but complexity can mean different things. Two possible examples of this in regression:
When there are multiple sets of weights that produce the same output, we generally prefer models with smaller weights, so as not to amplify any noise in the data, and to not depend too heavily on any single component of the data. Simpler smaller weights.
When a model has weights that are close to zero it may indicate that the components those weights are scaling are negligible, possibly random noise, that should be excluded. We could consider 'pruning' those components or features, or pushing those weights all the way to zero. Simpler weights that are too small are deleted set to zero.
To achieve both of these goals, we utilize what's known as regularization. In this case, we augment the loss function so that we penalize models that are more complex, and reward models that are more simple.
Ridge Regression: In this case, we take the loss function to be
This loss will be smaller when not only is the loss on the data set small, but the weights are also small. Models with smaller weights will be preferred over models with the same error but larger weights. The constant determines exactly how strong the pressure toward small weights will be
Lasso Regression: In this case, we take the loss function to be
By adding this penalty term, weights are pressured to be smaller but weights that would be large without it are pressured to be slightly smaller, and weights that would be small without it are pressured all the way to zero. The threshold for this behavior is set by the larger is the more weights are pushed all the way to zero. We'll talk about this more in class and recitation, but it serves to automatically 'prune' features or components of that are below a certain significance.
Regularization Bonus: Why is it worth keeping out of the regularization penalty terms?
Problem : For training data, and testing data, consider fitting a model by minimizing the ridge regression loss, for various values of Plot the testing loss without the ridge penalty as a function of Are you able to improve generalization of your model? What range of seems useful?
Problem : For training data, testing data, consider fitting a model by minimizing the lasso regression loss, for various values of Show that as goes up the number of nonzero terms in the fitted model go down. What features or components of the data are most persistent surviving across a range of How does that compare with what we know about how actually depends on
For this problem, you should generate at least these two plots:
As a function of plot the number of nonzero weights that survive after training.
For each dots, plot the where that weight goes to killing that feature.
Bonus: Note that the lasso regression term does not have a well defined gradient everywhere, because of the sharp point of the absolute value function. How does PyTorch handle this when calculating gradients? Do some research.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started