Question

1 Approved Answer

Posted on Oct 13, 2024

2, due Tuesday March 1 COMS 4721 Spring 2016 ? ? ? Do two of { Problem 1, Problem 2, Problem 3 }, and also

2, due Tuesday March 1 COMS 4721 Spring 2016 ? ? ? Do two of { Problem 1, Problem 2, Problem 3 }, and also Problem 4. ? ? ? Problem 1 (Margins; 10 points). The Perceptron convergence theorem is often stated in the following way. If R := max(x,y)2S kxk2 , and u? 2 Rd satises ku? k2 = 1 and for all (x, y) 2 S yhu? , xi for some > 0, then Perceptron halts after at most (R/ )2 iterations. Explain why this theorem is the same as what was presented in lecture. Specically, let w? be the vector from the Perceptron convergence theorem as given in the lecture, with length kw? k2 as small as possible. And let u? be the vector from the present version of the Perceptron convergence theorem such that is as large as possible. What is the relationship between w? and u? , and between kw? k2 and ? What is the shortest distance from a data point x from S to the (homogeneous) hyperplane with normal vector w? ? Give succinct (but precise) explanations for your answers. Problem 2 (Features; 10 points). It is common to pre-preprocess the feature vectors in Rd before passing them to a learning algorithm. Two simple ways to pre-process are as follows. 1 P Centering: Subtract the mean := |S| (x,y)2S x (of the training data) from every feature vector: x 7! x . Standardization: Perform centering, and then divide every feature by the per-feature stanq P 1 dard deviation i = |S| (x,y)2S (xi i )2 : (x1 , x2 , . . . , xd ) 7! x1 1 1 x2 , 2 2 ,..., xd d d . (The same transformations should be applied to all feature vectors you encounter, including any future test points.) For each of the following learning algorithms, and each of the above pre-processing transformations, explain whether or not each of the transformation can aect the resulting learned classier. (a) The classier based on the generative model where class conditional distributions are multivariate Gaussian distributions with a xed covariance equal to the identity matrix I. Assume MLE is used for parameter estimation. (b) The 1-NN classier using Euclidean distance. (c) The greedy decision tree learning algorithm with axis-aligned splits. (For concreteness, assume Gini index is used as the uncertainty measure, and the algorithm stops after 20 leaf nodes.) (d) Empirical Risk Minimization: the (intractable) algorithm that nds the linear classier (both the weight vector and threshold) that has the smallest training error rate. You should assume the following: (i) the per-feature standard deviations are never zero; and (ii) there are never any \"ties\" whenever you do an arg max or an arg min. Also ignore computational and numerical precision issues. 1 Problem 3 (Covariance matrices; 10 points). Let X be a mean-zero random vector in Rd (so E(X) = 0). Let := E(XX > ) be the covariance matrix of X, and suppose its eigenvalues are 2 > 0 be a positive number. 1 2 d . Let (a) What are the eigenvalues of + (b) What are the eigenvalues of ( + 2 I? 2 I) 1 ? (c) Suppose R 2 Rdd is an invertible symmetric matrix that satises RR = + 2 I. Let v 1 2 Rd be an eigenvector of corresponding to its largest eigenvalue 1 . What is the variance of the random variable v > R 1 X? Explain your answer succinctly (but precisely). 1 Hint: The answers can be given in terms of 2 and the eigenvalues of . Problem 4 (Linear classiers; 40 points). Download the spam data set spam.mat from Courseworks. This is the data set described in the ESL text for a binary classication problem of predicting whether an e-mail is spam or not. The training data and test data are in the usual format. You can read about the features in ESL (Chapter 9.1, pages 300-301). Write a MATLAB script or Python script that, using only the training data, tries out six dierent methods for learning linear classiers, and ultimately selects (via ten-fold cross validation) and uses one of these methods. (Think of the learning method itself as a \"hyperparameter\"!) The six methods to try are the following. 1. Averaged-Perceptron with 64 passes through the data. Averaged-Perceptron is like Voted-Perceptron (which uses Online Perceptron), except instead of forming the nal classier as a majority vote over the various linear classiers, you simply form a single linear classier by averaging the weight vectors and thresholds of the various linear classiers. Important: Before each pass of Online Perceptron, you should randomly shue the order of the training examples. 2. Logistic regression model with MLE for parameter estimation. 3. Generative model classier where class conditional distributions are multivariate Gaussian distributions with shared covariance matrix for all classes. Use MLE for parameter estimation. 4. Same as above, except arbitrary Gaussians (i.e., each class with its own covariance matrix). 5&6. Averaged-Perceptron and logistic regression as above, with feature map : R57 ! R1710 given by (x) := (x1 , x2 , . . . , x57 , x2 , x2 , . . . , x2 , x1 x2 , x1 x3 , . . . , x1 x57 , . . . , x56 x57 ). No need to use 1 2 57 the kernel trick here. Note that we want to learn linear classiers that are possibly non-homogeneous: so you should learn a weight vector w in R57 (or R1710 ) and a threshold value t 2 R. You should write your own code for implementing Online Perceptron, Averaged-Perceptron, and cross-validation (put these code in separate les). The code (and the driver script that runs everything) should be easy-to-read, commented as necessary. For the other learning methods, you can use existing library implementations, but you are responsible for the correctness of these implementations (as per the specications from the course lectures). Report the cross-validation error rates for all methods, the training error rate of the classier learned by the selected method (and state which method was chosen), and the test error rate for the learned classier. Submit all codes & scripts you write yourself, and give references to any existing software you use for the other learning algorithms. 2