Question

1 Approved Answer

Posted on Oct 13, 2024

Korea Advanced Institute of Science and Technology Department of Electrical Engineering & Computer Science EE531 Statistical Learning Theory, Spring 2016 Assignment I Issued: Mar. 19,

Korea Advanced Institute of Science and Technology Department of Electrical Engineering & Computer Science EE531 Statistical Learning Theory, Spring 2016 Assignment I Issued: Mar. 19, 2016 Due: Apr. 2, 2016 Policy Group study is encouraged; however, assignment that you hand-in must be of your own work. Anyone suspected of copying others will be penalized. The homework will take considerable amount of time so start early. There are MATLAB problems in the assignment, and it may require some self study. 1. A family of probability distribution functions that are called the exponential family has a special form given by p(x|) = h(x)g()exp{ T u(x)} (1) where x may be scalar or vector random variable, and may be continuous or discrete. Here the is called the natural parameters of the distribution and can be scalar or vector, and u(x) is some function of x. The function g() can be interpreted as the coefficient that ensures that the distribution is normalized and therefore satisfies Z g() h(x)exp{ T u(x)}dx = 1 where the integration is replaced by summation if x is a discrete variable. Consider a single binary random variable x {0, 1} such that its probability of x = 1 is denoted by the parameter so that p(x = 1|) = where 0 1 then the probability distribution over x also known as Bernoulli distribution can be written as p(x|) = Bern(x|) = x (1 )1x . Expressing the right-hand side as the exponential of the logarithm, we have p(x|) = = exp{x ln + (1 x) ln(1 )} (1 )exp{ln( )x} (1 ) (2) Comparison with ( 1) allows us to identify u(x) = x h(x) = g() = () = 1 1 . 1 + exp() For the following distributions, prove that is part of the exponential family by finding u(x), h(x) and g(). 1 (i) Gaussian: p(x|, 2 ) = N (x|, 2 ) = 1 exp (2 2 )1/2 \u001a 1 (x )2 2 2 \u001b (ii) Beta: [0, 1], a > 0, b > 0 p(|a, b) = Beta(|a, b) = (a + b) a1 (1 )b1 (a)(b) R where (x) = 0 ux1 eu du (properties:(1) = 1, (z + 1) = z(z), (n + 1) = n!) (iii) Multinomial ( = [1 , , k ]) p(m1 , m2 , . . . , mK |, N ) = M ult(m1 , m2 , . . . , mK |, N ) = where PK k=1 mk = N and 0 k 1, P k N! K mk m1 !m2 ! . . . mK ! k=1 k k = 1 2. Consider a binomial random variable x given by \u0012 \u0013 N Bin(m|N, ) = m (1 )N m m with prior distribution for given by the beta distribution, ( (+) 1 (1 )1 ()() Beta(|, ) = 0 if 0 < < 1 otherwise (i) Suppose we have observed m occurrences of x = 1 and l occurrences of x = 0. Show that the posterior mean value of x lies between the prior mean and the maximum likelihood estimate of . To do this, show that the posterior mean can be written as times the prior mean plus (1 ) times the maximum likelihood estimate, where 0 1. From this, what can you infer about the relationship between prior, posterior and maximum likelihood solution? (ii) [MATLAB] a. First, we set the prior distribution for . Plot the beta distribution with = 2, = 8. b. Now use the given data matrix to calculate the likelihood. Load data.mat, then you will get a 10x10 data matrix. Each row has one data sequence which has 10 binary values generated independently from bernoulli distribution with = 0.5. So you can calculate the likelihood of each data sequence using Binomial distribution. Write a code that reads each row and calculates the likelihood, and updates the posterior distribution iteratively(total 10 iterations). At each iteration, plot the prior(the posterior of one step before), the likelihood, and the posterior distribution. Discuss the result. c. At each iteration, calculate the prior mean, posterior mean and ML estimator of . Is the result compatible with (i)? You can check if is between 0 and 1. 3. In classification, the loss function we usually want to minimize is the 0/1 loss: l(f (x), y) = 1{f (x) 6= y} where f (x), y {0, 1}(i.e. binary classification). In this problem we will consider the effect of using an asymmetric loss function: l, (f (x), y) = 1{f (x) = 1, y = 0} + 1{f (x) = 0, y = 1} Under this loss function, the two types of errors receive different weights, determined by , > 0. 2 (i) (10 pts) Determine the Bayes optimal classifier f (x), i.e. the classifier that achieves minimum Bayes risk assuming P (x, y) is known, for the loss l, where , > 0. (ii) (10 pts) Suppose that the class y = 0 is extremely uncommon (i.e. P (y = 0) is small). This means that the classifier f (x) = 1 for all x will have a good(low) risk. To overcome this unbalance problem, we may try to put the two classes on even footing by considering the following risk: R = P (f (x) = 1|y = 0) + P (f (x) = 0|y = 1) Show how this risk is equivalent to choosing a certain , and minimizing the risk where the loss function is l, ? (iii) (10 pts) Consider the following classification problem. We first choose the label Y Bernoulli( 12 ), which is 1 with probability 12 . If Y = 1, then X Bernoulli(p); otherwise, X Bernoulli(q). Assume that p > q. What is the Bayes optimal classifier, and what is its risk? 4. Consider the spam-mail example that was covered in lecture. For your benefit, it is restated with slightly different notations but essentially the same way. Given labeled training data D = {Xi , yi } where class label yi = c {0, 1} such that yi = 1 for spam mail and yi = 0 for ham mail and where Xi = [xi1 , xi2 , . . . , xiK ] is an indicator vector of all the words in the dictionary i.e. xij = 1 if the j word exist in the mail; otherwise, xij = 0 if it does not exist. There are K words in the dictionary. The probability of the ith word in class c is denoted as i,c (to be clear class refers to either spam (c = 1) or ham(c = 0), and it was estimated (Bayes' estimate, conditional mean) from training D as ni,c + 1 i,c = nc + 2 (3) where nc is the count of class c mails and ni,c is the count of ith word in class c. Then using 0-1 Loss and minimizing the Bayes Risk (Mean risk) defined as XX E[L( y , y)] = p( y , y)L( y , y) (4) y where \u001a y 0 if y = y 1 if y 6= y (5) y = arg max p(y = y|X) (6) L( y , y) = the decision rule (Bayes classifier) y was derived. (i) What assumptions are made in the estimation of Eq. (3) ? Be sure to be exact in your statement. (e.g. something has this probability distribution and the estimate is the (...) of the distribution). (ii) Derive the decision rule Eq. (6) from Eq. (4). (iii) Write the decision rule in terms of p(X|y) and p(y). Then assume that the occurrence of a word is independent of one another and write the decision rule interms of p(xi |y) and p(y) where X = [xi , . . . , xK ]. 3 (iv) Compute the likelihood as Z p(X|y, D) = p(X, |y, D)d (7) p(X|, y, D)p(|D)d (8) p(X|, y)p(|D)d (9) Z = Z = where it is assumed p(X|, y, D) = p(X|, y). Assume that the occurrence of a word is (a+b) a1 (1 independent of one another. You may need the following: Beta(a, b) = (a)(b) b1 ) and (n + 1) = n(n). 5. Taxicab (tramcar) problem Suppose you arrive in a new city and see a taxi numbered 100. How many taxis are there in this city? Let us assume taxis are numbered sequentially as integers starting from 0, up to some unknown upper bound . (we number taxis from 0 for simplicity; we can also count from 1 without changing the analysis.) Hence the likelihood function is p(x) = U (0, ), the uniform distribution. The goal is to estimate . The maximum likelihood estimate of uniform distribution Unif(0, ) is = max(D), but this is unsuitable for predicting future data since it puts zero probability mass outside the training data. The conjugate prior of uniform distribution is Pareto distribution, p() = Pareto(|b, K). Given a Pareto prior, the joint distribution of and D = (x1 , ..., xN ) is p(D, ) = KbK N +K+1 1( max(D)) (10) Let m = max(D). The evidence (the probability that all N samples came from the same uniform distribution) is Z Kbk p(D) = d (11) N +K+1 m ( K if m b (N +K)bN = (12) Kbk if m > b (N +K)mN +K (i) Suppose we see one taxi numbered 100, so D = {100} , m = 100, N = 1. Using an (improper) non-informative prior on of the form p() = P a(|0, 0) 1/, what is the posterior p(|D)? (ii) Compute the posterior mean, mode and median number of taxis in the city, if such quantities exist. (iii) Rather than trying to compute a point estimate of the number of taxis, we can compute the predictive density over the next taxicab number using Z 0 p(D |D, ) = p(D0 |)p(|D, )d = p(D0 |) (13) where = (b, K) are the hyper-parameters, = (c, N + K) are the updated hyperparameters. Now consider the case D = {m}, and D0 = {x}. Using equation(12), write down an expression for p(x|D, ) As above, use a non-informative prior b = K = 0. 4 (14) (iv) Use the predictive density formula to compute the probability that the next taxi you will see (say, the next day) has number 100, 50, or 150, i.e., compute p(x = 100|D, ), p(x = 50|D, ), p(x = 150|D, ). (v) Briefly describe (1-2 sentences) some ways we might make the model more accurate at prediction. 6. Posterior predictive for Dirichlet-multinomial (i) Suppose we compute the empirical distribution over letters of the Roman alphabet plus the space character (a distribution over 27 values) from 2000 samples. Suppose we see the letter \"e\" 260 times. What is p(x2001 = e|D), if we assume Dir(1 , ..., 27 ), where k = 10 for all k? (ii) Suppose, in the 2000 samples, we saw \"e\" 260 times, \"a\" 100 times, and \"p\" 87 times. What is p(x2001 = p, x2002 = a|D), if we assume Dir(1 , ..., 27 ), where k = 10 for all k? Show your work. 7. BIC for Gaussians. The Bayesian information criterion (BIC) is a penalized log-likelihood function that can be used for model selection. It defined as d BIC = log p(D|M L ) log(N ) 2 (15) where d is the number of free parameters in the model and N is the number of samples. In this question, we will see how to use this to choose between a full covariance Gaussian and a Gaussian with a diagonal covariance. Obviously a full covariance Gaussian has higher likelihood, but it may not be \"worth\" the extra parameters if the improvement over a diagonal covariance matrix is too small. So we use the BIC score to choose the model. we can write log p(D|, ) S N N log(||) 1 S) tr( 2 2 N 1 X T (Xi X)(X i X) N i=1 = (16) = (17) is the scatter matrix (empirical covariance), the trace of a matrix is the sum of its where S diagonals, and we have used the trace trick. (i) Derive the BIC score for a Gaussian in D dimensions with full covariance matrix. Simplify your answer as much as possible, exploiting the form of the MLE. Be sure to specify the number of free parameters d. (ii) Derive the BIC score for a Gaussian in D dimensions with a diagonal covariance matrix. Be sure to specify the number of free parameters d. Hint: for the diagonal case, the ML M L except the off-diagonal terms are zero: estimate of is the same as diag = diag( M L (1, 1), ..., M L (D, D)) 5 (18)