Question

1 Approved Answer

Posted on Oct 13, 2024

2 CS229 Problem Set #4 Solutions log m ! p(x(i) |)p() = log p() + m log p(x(i) |) i=1 i=1 = log p()

2 CS229 Problem Set #4 Solutions log m ! p(x(i) |)p() = log p() + m " log p(x(i) |) i=1 i=1 = log p() + m " log " p(x(i) , z (i) |) log " Qi (z (i) ) i=1 = log p() + log p() + m " z (i) i=1 z (i) m "" Qi (z (i) ) log i=1 z (i) p(x(i) , z (i) |) Qi (z (i) ) p(x(i) , z (i) |) , Qi (z (i) ) where we just did straightforward substitutions and rewritings, and the last step is given by Jensen's inequality. Requiring the inequality to be tight, gives us the E-step: Qi (z (i) ) = p(z (i) |x(i) ; ). For the M-step we maximize the lower bound, i.e. $ # m " " p(x(i) , z (i) |) (i) . = arg max log p() + Qi (z ) log Qi (z (i) ) i=1 (i) z The M-step is tractable, since it only requires maximizing a linear combination of tractable concave terms log p(x, z|) and log p(). 2. [22 points] EM application Consider the following problem. There are P papers submitted to a machine learning conference. Each of R reviewers reads each paper, and gives it a score indicating how good he/she thought that paper was. We let x(pr) denote the score that reviewer r gave to paper p. A high score means the reviewer liked the paper, and represents a recommendation from that reviewer that it be accepted for the conference. A low score means the reviewer did not like the paper. We imagine that each paper has some \"intrinsic,\" true value that we denote by p , where a large value means it's a good paper. Each reviewer is trying to estimate, based on reading the paper, what p is; the score reported x(pr) is then reviewer r's guess of p . However, some reviewers are just generally inclined to think all papers are good and tend to give all papers high scores; other reviewers may be particularly nasty and tend to give low scores to everything. (Similarly, dierent reviewers may have dierent amounts of variance in the way they review papers, making some reviewers more consistent/reliable than others.) We let r denote the \"bias\" of reviewer r. A reviewer with bias r is one whose scores generally tend to be r higher than they should be. All sorts of dierent random factors influence the reviewing process, and hence we will use a model that incorporates several sources of noise. Specifically, we assume that reviewers' scores are generated by a random process given as follows: y (pr) N (p , p2 ), z (pr) x(pr) |y (pr) , z (pr) N (r , r2 ), N (y (pr) + z (pr) , 2 ). CS229 Problem Set #4 Solutions 3 The variables y (pr) and z (pr) are independent; the variables (x, y, z) for dierent paperreviewer pairs are also jointly independent. Also, we only ever observe the x(pr) 's; thus, the y (pr) 's and z (pr) 's are all latent random variables. We would like to estimate the parameters p , p2 , r , r2 . If we obtain good estimates of the papers' \"intrinsic values\" p , these can then be used to make acceptance/rejection decisions for the conference. We will estimate the parameters by maximizing the marginal likelihood of the data {x(pr) ; p = 1, . . . , P, r = 1, . . . , R}. This problem has latent variables y (pr) and z (pr) , and the maximum likelihood problem cannot be solved in closed form. So, we will use EM. Your task is to derive the EM update equations. Your final E and M step updates should consist only of addition/subtraction/multiplication/division/log/exp/sqrt of scalars; and addition/subtraction/multiplication/inverse/determinant of matrices. For simplicity, you need to treat only {p , p2 ; p = 1 . . . P } and {r , r2 ; r = 1 . . . R} as parameters. I.e. treat 2 (the conditional variance of x(pr) given y (pr) and z (pr) ) as a fixed, known constant. (a) In this part, we will derive the E-step: (i) The joint distribution p(y (pr) , z (pr) , x(pr) ) has the form of a multivariate Gaussian density. Find its associated mean vector and covariance matrix in terms of the parameters p , p2 , r , r2 , and 2 . [Hint: Recognize that x(pr) can be written as x(pr) = y (pr) + z (pr) + (pr) , where (pr) N (0, 2 ) is independent Gaussian noise.] (ii) Derive an expression for Qpr (y (pr) , z (pr) ) = p(y (pr) , z (pr) |x(pr) ) (E-step), using the rules for conditioning on subsets of jointly Gaussian random variables (see the notes on Factor Analysis). (b) Derive the M-step updates to the parameters {p , r , p2 , r2 }. [Hint: It may help to express the lower bound on the likelihood in terms of an expectation with respect to (y (pr) , z (pr) ) drawn from a distribution with density Qpr (y (pr) , z (pr) ).] Remark. In a recent machine learning conference, John Platt (whose SMO algorithm you've seen) implemented a method quite similar to this one to estimate the papers' true scores p . (There, the problem was a bit more complicated because not all reviewers reviewed every paper, but the essential ideas are the same.) Because the model tried to estimate and correct for reviewers' biases r , its estimates of p were significantly more useful for making accept/reject decisions than the reviewers' raw scores for a paper. Answer: Let denote the whole set of parameters we are estimating, then the EM steps for our problem are (at a high level): (a) (E-step) For each p, r, set Qpr (y (pr) , z (pr) ) = p(y (pr) , z (pr) |x(pr) ; ). % %R (pr) (b) (M-step) Set = arg max P , Y (pr) , Z (pr) ; ). p=1 r=1 EQpr (Y (pr) ,Z (pr) ) log p(x Now it's a matter of working out how these updates can actually be computed. For the E-step, if we use Bayes's Rule to compute p(y (pr) , z (pr) |x(pr) ), then we'll get integrals of Gaussians in the denominator, which are tough to compute. Instead, observe that p(y (pr) , z (pr) , x(pr) ) = p(y (pr) , z (pr) )p(x(pr) |y (pr) , z (pr) ) = p(y (pr) )p(z (pr) )p(x(pr) |y (pr) , z (pr) ) 4 CS229 Problem Set #4 Solutions is the product of three Gaussian densities, so it is itself a multivariate Gaussian density. Therefore, the joint distribution p(y (pr) , z (pr) , x(pr) ) is some type of normal distribution so we can use the rules for conditioning Gaussians to compute the conditional. To get a form for the joint density, we'll exploit the fact that a multivariate Gaussian density is fully parameterized by its mean vector and covariance matrix. To compute the mean vector, we'll rewrite the x(pr) in the following way: x(pr) = y (pr) + z (pr) +(pr) , where (pr) N (0, 2 ) is independent Gaussian noise.1 Then, E[y (pr) ] = p , E[z (pr) ] = r and E[x(pr) ] = E[y (pr) + z (pr) + (pr) ] = E[y (pr) ] + E[z (pr) ] + E[(pr) ] = p + r + 0 = p + r . To compute the covariance matrix, observe that Var(y (pr) ) = p2 , Var(z (pr) ) = r2 , and Cov(y (pr) , z (pr) ) = Cov(z (pr) , y (pr) ) = 0 (since y (pr) and z (pr) are independent). Also, since y (pr) , z (pr) , and (pr) are independent, we have Var(x(pr) ) = Var(y (pr) + z (pr) + (pr) ) = Var(y (pr) ) + Var(z (pr) ) + Var((pr) ) = p2 + r2 + 2 . Finally, Cov(y (pr) , x(pr) ) = Cov(x(pr) , y (pr) ) = Cov(y (pr) + z (pr) + (pr) , y (pr) ) = Cov(y (pr) , y (pr) ) + Cov(z (pr) , y (pr) ) + Cov((pr) , y (pr) ) = p2 + 0 + 0 = p2 . where the second to last equality follows from independence of y (pr) , z (pr) and (pr) . Similarly, we can show that Cov(z (pr) , x(pr) ) = Cov(x(pr) , z (pr) ) = r2 . This allows us to write y (pr) , z (pr) , x(pr) 2 p p , 0 r N p + r p2 0 r2 r2 p2 r2 2 2 2 p + r + Now we can use the standard results for conditioning on subsets of variables for Gaussians (from the Factor Analysis notes) to obtain: 01 0 / ./ pr,Y Y pr,Y Z pr,Y , Qpr (y (pr) , z (pr) ) = N pr,ZY pr,ZZ pr,Z 1 To see why this follows from the definition in the problem statement, observe that the probability that (pr) = x(pr) y (pr) z (pr) takes on any specific value is p((pr) = |y (pr) , z (pr) ) = p(x(pr) y (pr) z (pr) = |y (pr) , z (pr) ) = p(x(pr) = + y (pr) + z (pr) |y (pr) , z (pr) ) = 1 exp( 21 2 2 ) which does not depend on either 2 y (pr) or z (pr) ; hence (pr) can be regarded as independent zero-mean Gaussian noise with 2 variance. 5 CS229 Problem Set #4 Solutions where pr pr = / pr,Y Y pr,ZY p2 (pr) (x ) + p r p 2 + 2 + 2 p r = pr,Y = 2 pr,Z r (pr) r + 2 (x p r ) + p2 + r2 / 2 2 0 0 1 p (r + 2 ) p2 r2 pr,Y Z . = 2 p2 r2 r2 (p2 + 2 ) pr,ZZ p + r2 + 2 / 0 (1) (2) For the M-step, an important realization is that the Qpr distribution is defined in terms of t , while we want to choose the parameters for the next time step, t+1 . This means that the parameters of the Qpr distributions are constant in terms of the parameters we wish to maximize. Maximizing the expected log-likelihood, we have (letting EQ [] denote expectations with respect to Qpr (y (pr) , z (pr) ) for each p and r, respectively), = arg max P X R X EQ log p(x(pr), y (pr) , z (pr) ; ) p=1 r=1 = arg max R P X X = arg max P X R X EQ log 1 1 1 (y (pr) p )2 2 (z (pr) r )2 p r 2p2 2r P X R X EQ log 1 1 1 ((y (pr) )2 2y (pr) p + 2p ) 2 ((z (pr) )2 2z (pr) r + r2 ) p r 2p2 2r - 12 (y (pr) p )2 1 (z (pr) r )2 1 1 1 1 (x(pr) y (pr) z (pr) )2 EQ log e 22 e 2p e 2r2 2 2p 2r p=1 r=1 - P X R X 1 1 1 1 (pr) (pr) (pr) 2 (pr) 2 (pr) 2 EQ log = arg max (x y z ) (y ) (y ) p r 2 2 2p2 2r2 (2)3/2 p r p=1 r=1 = arg max p=1 r=1 p=1 r=1 - - = arg max P X R X log 1 1 1 (EQ [(y (pr) )2 ] 2EQ [y (pr) ]p + 2p ) 2 (EQ [(z (pr) )2 ] 2EQ [z (pr) ]r + r2 ) p r 2p2 2r = arg max P X R X log - 1 1 1 2 2 2 2 ( + 2 + ) ( + 2 + ) . pr,Y Y pr,Y p pr,ZZ pr,Z r pr,Y p pr,Z r p r 2p2 2r2 p=1 r=1 p=1 r=1 where the equality in the last line follows from EQ [y (pr) ] = pr,Y and EQ [(y (pr) )2 ] = (EQ [(y (pr) )2 ] EQ [y (pr) ]2 ) + EQ [y (pr) ]2 = pr,Y Y + 2pr,Y (and similarly for EQ [z (pr) ] and - 6 CS229 Problem Set #4 Solutions EQ [(z (pr) )2 ]). Setting derivatives w.r.t. parameters p , r , p , r to 0, R 1 X (2p 2pr,Y ) = 0 2p2 r=1 = p = R 1 X pr,Y R r=1 (3) P 1 X (2r 2pr,Z ) = 0 2r2 p=1 = r = P 1 X pr,Z P p=1 (4) - 1 1 + 3 (pr,Y Y + 2pr,Y 2pr,Y p + 2p ) = 0 p p = p2 = R 1 X (pr,Y Y + 2pr,Y 2pr,Y p + 2p ) R r=1 R X r=1 (5) P X p=1 - 1 1 + 3 (pr,ZZ + 2pr,Z 2pr,Z r + r2 ) = 0 r r = r2 = 1 P P X p=1 (pr,ZZ + 2pr,Z 2pr,Z r + r2 ) (6) Using the above results, we can restate our E and M steps in terms of actual computations: (a) (E-step) For each p, r, compute pr , pr using equations (??),(??) (b) (M-step) Compute p , r , p2 , r2 using equations (??), (??), (??), (??). 3. [14 points] PCA In class, we showed that PCA finds the \"variance maximizing\" directions onto which to project the data. In this problem, we find another interpretation of PCA. Suppose we are given a set of points {x(1) , . . . , x(m) }. Let us assume that we have as usual preprocessed the data to have zero-mean and unit variance in each coordinate. For a given unit-length vector u, let fu (x) be the projection of point x onto the direction given by u. I.e., if V = {u : R}, then fu (x) = arg min ||x v||2 . vV Show that the unit-length vector u that minimizes the mean squared error between projected points and original points corresponds to the first principal component for the data. I.e., show that m " x(i) fu (x(i) )22 . arg min u:uT u=1 i=1 gives the first principal component. Remark. If we are asked to find a k-dimensional subspace onto which to project the data so as to minimize the sum of squares distance between the original data and their projections, then we should choose the k-dimensional subspace spanned by the first k principal components of the data. This problem shows that this result holds for the case of k = 1