Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

3 Regularization: Ridge and Lasso Frequently, we are interested in finding 'simpler' models over more complex ones. This is essentially Occam's Razor ( simpler models

3

Regularization: Ridge and Lasso

Frequently, we are interested in finding 'simpler' models over more complex ones. This is essentially Occam's Razor

(

simpler models

better models

),

but complexity can mean different things. Two possible examples of this in regression:

When there are multiple sets of weights that produce the same output, we generally prefer models with smaller weights, so as not to amplify any noise in the data, and to not depend too heavily on any single component of the data. Simpler

smaller weights.

When a model has weights that are close to zero

-

it may indicate that the components those weights are scaling are negligible, possibly random noise, that should be excluded. We could consider 'pruning' those components or features, or pushing those weights all the way to zero. Simpler

weights that are too small are deleted

/

set to zero.

To achieve both of these goals, we utilize what's known as regularization. In this case, we augment the loss function so that we penalize models that are more complex, and reward models that are more simple.

Ridge Regression: In this case, we take the loss function to be

This loss will be smaller when not only is the loss on the data set small, but the weights are also small. Models with smaller weights will be preferred over models with the same error but larger weights. The constant

> 0

determines exactly how strong the pressure toward small weights will be

.

Lasso Regression: In this case, we take the loss function to be

By adding this penalty term, weights are pressured to be smaller

-

but weights that would be large without it are pressured to be slightly smaller, and weights that would be small without it are pressured all the way to zero. The threshold for this behavior is set by

-

the larger

,

the more weights are pushed all the way to zero. We'll talk about this more in class and recitation, but it serves to automatically 'prune' features or components of

x_{?}

that are below a certain significance.

Regularization Bonus: Why is it worth keeping

w_{0}

out of the regularization penalty terms?

3

Problem

6

: For

N = 300

training data, and

N = 200

testing data, consider fitting a model by minimizing the ridge regression loss, for various values of

> 0 .

Plot the testing loss

(

without the ridge penalty

)

as a function of

.

Are you able to improve generalization of your model? What range of

seems useful?

Problem

7

: For

N = 300

training data,

N = 200

testing data, consider fitting a model by minimizing the lasso regression loss, for various values of

> 0 .

Show that as

goes up

,

the number of non

-

zero terms in the fitted model go down. What features or components of the data are most persistent

(

surviving across a range of

) ?

How does that compare with what we know about how

y

actually depends on

x_{?} ?

For this problem, you should generate at least these two plots:

As a function of

> 0,

plot the number of non

-

zero weights that survive after training.

For each

i = 0, 1, 2, 3,

dots,

100,

plot the

where that weight goes to

0,

killing that feature.

Bonus: Note that the lasso regression term does not have a well defined gradient everywhere, because of the sharp point of the absolute value function. How does PyTorch handle this when calculating gradients? Do some research.