It is possible to define a regularizer to minimize e(loss(Y(e),Y(e)) + regularizer(Y)) rather than formula

Question:

It is possible to define a regularizer to minimize ∑e(loss(Y(e),Y(e)) +

λ ∗ regularizer(Y)) rather than formula (7.5) (page 303). How is this different than the existing regularizer? [Hint: Think about how this affects multiple datasets or for cross validation.]

Suppose λ is set by k-fold cross validation, and then the model is learned for the whole dataset. How would the algorithm be different for the original way(s) of defining a regularizer and this alternative way? [Hint: There is a different number of examples used for the regularization than there is the full dataset; does this matter?] Which works better in practice?

Fantastic news! We've Found the answer you've been seeking!