Answered step by step
Verified Expert Solution
Question
1 Approved Answer
2. Which properties of Lasso path generalize to other loss functions? Recall we showed the optimality conditions for a Lasso solution: where as we
2. Which properties of Lasso path generalize to other loss functions? Recall we showed the optimality conditions for a Lasso solution: where as we noted in class, k B(X)=0 = = X(Y - XB(A)) = sgn(B(A)) B(A) k = 0 |X (Y XB(A))| < 2 < NE (1) 2 Vk |X (Y XB(A))| (2) (3) 2' X(YXB(A)) ARSS(B) |B=B(X) is the derivative of the loss function. We noted in class the following properties of the set of solutions {B(A) : 0 < }: i All the variables in the solution are "highly correlated" with the current residual from (1) above, and all the variables with zero coefficients are less correlated" with the current residual from (23) above. ii The solution path {(A) : 0 x 0} as a function of A can be described by a collection of "breakpoints" > 1 > 2 > ... > K >0 such that the set Ak of active variables with non-zero coefficients is fixed for all solutions B(A) with Ak k+1. iii B(A) is a piecewise linear function, in other words, for in this range we have: B(A) = (Ak) + Uk(Ak ), for a vector vk we explicitly derived in class. Assume now that we want to build a different type of model with a different convex and infinitely differentiable loss function, say a logistic regression model for a binary classification task, and add lasso penalty to that: B(X) n = arg min log {1+ exp{yx{{B}} + \||B||1 i=1 We would like to investigate which of the properties above still holds for the solution of this problem. (a) Using simple arguments about derivatives and sub-derivatives as we used in class for the quadratic loss case, argue that that three conditions like (1)-(3) can be written for this case too, with the appropriate derivative replacing the empirical correlation. Derive these expressions explicitly for the logistic case. (b) Explain clearly why this implies that properties (i), (ii) still hold (for (ii), you may find the continuity of the derivative useful). (c) Does the piecewise linearity still hold? A clear intuitive explanation is sufficient here. Hint: Consider how we obtained the linearity for squared loss in A in class by decomposing the correlation vector XT (Y - X) = XTY XTX.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started