Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

had a single parameter vector). To make a prediction, we can extend the binary prediction rule to multiclass setting by letting the prediction function to

image text in transcribedimage text in transcribed
image text in transcribedimage text in transcribed
had a single parameter vector). To make a prediction, we can extend the binary prediction rule to multiclass setting by letting the prediction function to be: f(multiclass ) (ac) = argmax Ply = k|ac] KE {1,2,...,K} As in logistic regression, we will assume that each yn is i.i.d., which leads to the following likelihood function, n P [y1, y2, . . ., Un | 21, X2, . . ., In; W1, . .., WK] = Plyi | xi, w1, . . ., wk] i=1 a) Please write down explicitly the log likelihood function and simplify it as much as you can to derive the training loss needs to be optimized. b) Compute the gradient of likelihood with respect to each wx and simply it c) Derive the stochastic gradient descent (SGD) update for multiclass logistic regression.Problem 2. (20 points) In this problem we aim at generalizing the Logistic Regression algorithm to solve multi-class classication problem, the setting where the label space in- cludes three or more classes, i.e., 32 : {1,2,...,K} for K 2 3. To do so, consider a training data 8 = {(331, pl), (m2, "#2), . . . , (run, 3%)} where the feature vectors are m, E Rd and y, E 39,2' = 1,2, ...,n. Here we assume that we have K different classes and each input :12,- is a of dimensional vector. We wish to t a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function. One way to generalize binary model to multiclass case is to have K sets of parameters 10;, for k = 1,2, . . . ,K, and compute the probability distribution of the output as follows (for simplicity we ignore the intercept parameters bk), T _ . _ mm\") _ IP[yk|:c,w1,...,wg] Eli-:1\"? mjm fork1,...,K Note that we indeed have a probability distribution, as 2;, P [y = k | :13, 11:1, . . . ,wK] = 1. To make the model identiable, we will x mg to 0, which means we have K 1 sets of parameters {101, . . . ,wK_1}, where each one is a ddimensional vector, to learn. To see this, note that parameter vector for class K can be inferred from rest as the sum of probabilitites for all labels sums up to one (similar to binary classication where we only

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Logic And Structure

Authors: Dirk Van Dalen

5th Edition

1447145585, 9781447145585

More Books

Students also viewed these Mathematics questions

Question

13. Give four examples of psychological Maginot lines.

Answered: 1 week ago