Answered step by step
Verified Expert Solution
Question
1 Approved Answer
had a single parameter vector). To make a prediction, we can extend the binary prediction rule to multiclass setting by letting the prediction function to
had a single parameter vector). To make a prediction, we can extend the binary prediction rule to multiclass setting by letting the prediction function to be: f(multiclass ) (ac) = argmax Ply = k|ac] KE {1,2,...,K} As in logistic regression, we will assume that each yn is i.i.d., which leads to the following likelihood function, n P [y1, y2, . . ., Un | 21, X2, . . ., In; W1, . .., WK] = Plyi | xi, w1, . . ., wk] i=1 a) Please write down explicitly the log likelihood function and simplify it as much as you can to derive the training loss needs to be optimized. b) Compute the gradient of likelihood with respect to each wx and simply it c) Derive the stochastic gradient descent (SGD) update for multiclass logistic regression.Problem 2. (20 points) In this problem we aim at generalizing the Logistic Regression algorithm to solve multi-class classication problem, the setting where the label space in- cludes three or more classes, i.e., 32 : {1,2,...,K} for K 2 3. To do so, consider a training data 8 = {(331, pl), (m2, "#2), . . . , (run, 3%)} where the feature vectors are m, E Rd and y, E 39,2' = 1,2, ...,n. Here we assume that we have K different classes and each input :12,- is a of dimensional vector. We wish to t a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function. One way to generalize binary model to multiclass case is to have K sets of parameters 10;, for k = 1,2, . . . ,K, and compute the probability distribution of the output as follows (for simplicity we ignore the intercept parameters bk), T _ . _ mm\") _ IP[yk|:c,w1,...,wg] Eli-:1\"? mjm fork1,...,K Note that we indeed have a probability distribution, as 2;, P [y = k | :13, 11:1, . . . ,wK] = 1. To make the model identiable, we will x mg to 0, which means we have K 1 sets of parameters {101, . . . ,wK_1}, where each one is a ddimensional vector, to learn. To see this, note that parameter vector for class K can be inferred from rest as the sum of probabilitites for all labels sums up to one (similar to binary classication where we only
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started