Question

1 Approved Answer

Posted on Oct 14, 2024

Peter Orbanz porbanz@stat.columbia.edu Statistical Machine Learning (W4400) Spring 2013 http://stat.columbia.edu/porbanz/teaching/W4400/ Benjamin Reddy bmr2136@columbia.edu Homework 3 Due: 29 October 2013 Homework submission: We will collect your

Peter Orbanz porbanz@stat.columbia.edu Statistical Machine Learning (W4400) Spring 2013 http://stat.columbia.edu/porbanz/teaching/W4400/ Benjamin Reddy bmr2136@columbia.edu Homework 3 Due: 29 October 2013 Homework submission: We will collect your homework at the beginning of class on the due date. If you cannot attend class that day, you can leave your solution in my postbox in the Department of Statistics, 10th floor SSW, at any time before then. Problem 1 (Convex hulls and classifiers) Suppose we are given two training sets C1 = {x11 , . . . , x1m } and C2 = {x21 , . . . , x2n } . in Rd . In class, we have derived the support vector machine based on the idea that separating the two classes means separating their convex hulls. In this problem, you should check that this idea is valid. Homework problems: 1. Show that if the convex hulls of C1 and C2 intersect, the two sets cannot be linearly separable. 2. Show that the converse is also true: If the convex hulls do not intersect, the classes are linearly separable. To do so, recall the definition of the convex hull: If C = {x1 , . . . , xm } is a set of points in Rd , its convex hull conv(C) is the set of all points x Rd which can be represented as \"convex combinations\" of points in C, that is, points which can be written as m X x= i xi i=1 where i 0 and Pm ( conv(C1 ) = x= i=1 m X i=1 i = 1. The convex hulls of the two sets are ) m X 1 i xi i 0, i = 1 i=1 and n X j = 1 . j x2j j 0, conv(C2 ) = x = j=1 j=1 n X Recall also that C1 and C2 are linearly separable if and only if there is a vector vH Rd and a scalar c R such that the affine hyperplane defined by (vH , c) classifies the sets correctly: ( > 0 x C1 hx, vH i c = . < 0 x C2 Problem 2 (Combining kernels) It was already mentioned in class that kernel functions can be combined and modified in various ways to obtain new kernel functions. In this problem, we will convince ourselves that this is indeed true in two simple cases. Recall that a function k : Rd Rd R is a kernel if there is some function : Rd F, into some space F with scalar product h . , . iF , such that k(x, x0 ) = h(x), (x0 )iF for all x, x0 Rd . Homework problems: Let k1 (x, x0 ) and k2 (x, x0 ) be kernels on Rd . 1. Show that, for any positive real number a, k(x, x0 ) := ak1 (x, x0 ) is a kernel. 2. Show that k(x, x0 ) := k1 (x, x0 )k2 (x, x0 ) is a kernel. 3. Show that, for any positive integer p, k(x, x0 ) := k1 (x, x0 )p is a kernel. Problem 3 (Boosting) The objective of this problem is to implement the AdaBoost algorithm. We will test the algorithm on handwritten digits from the USPS data set. AdaBoost: Assume we are given a training sample (x(i) , yi ), i = 1, ..., n, where x(i) are data values in Rd and yi {1, +1} are class labels. Along with the training data, we provide the algorithm with a training routine for some classifier c (the \"weak learner\"). Here is the AdaBoost algorithm for the two-class problem: 1. Initialize weights: wi = 1 n 2. for b = 1, ..., B (a) Train a weak learner cb on the weighted training data. (b) Compute error: \u000fb := Pn i=1 wi I{yi 6=cb (x(i) )} Pn i=1 wi (c) Compute voting weights: b = log \u0010 1\u000fb \u000fb \u0011 \u0001 (d) Recompute weights: wi = wi exp b I{yi 6= cb (x(i) )} \u0010P \u0011 B (i) 3. Return classifier cB (x(i) ) = sgn c (x ) b b b=1 Decision stumps: Recall that a stump classifier c is defined by ( +m xj > c(x|j, , m) := m otherwise. (1) Since the stump ignores all entries of x except xj , it is equivalent to a linear classifier defined by an affine hyperplane. The plane is orthogonal to the jth axis, with which it intersects at xj = . The orientation of the hyperplane is determined by m {1, +1}. We will employ stumps as weak learners in our boosting algorithm. To train stumps on weighted data, use the learning rule Pn wi I{yi 6= c(x(i) |j, , m)} Pn (j , ) := arg min i=1 . (2) j, i=1 wi In the implementation of your training routine, first determine an optimal parameter j for each dimension j = 1, ..., d, and then select the j for which the cost term in (2) is minimal. Homework problems: 1. Implement the AdaBoost algorithm in R. The algorithm requires two auxiliary functions, to train and evaluate the weak learner. We also need a third function which implements the resulting boosting classifier. We will use decision stumps as weak learners, but a good implementation of the boosting algorithm should permit you to easily plug in arbitrary weak learners. To make sure that is possible, please use function calls of the following form: 2 pars <- train(X, w, y) for the weak learner training routine, where X is a matrix the columns of which are the training vectors x(1) , . . . , x(n) , and w and y are vectors containing the weights and class labels. The output pars is a list which contains the parameters specifying the resulting classifier. (In our case, pars will be the triplet (j, , m) which specifies the decision stump). label <- classify(X, pars) for the classification routine, which evaluates the weak learner on X using the parametrization pars. A function c hat <- agg class(X, alpha, allPars) which evaluates the boosting classifier (\"aggregated classifier\") on X. The argument alpha denotes the vector of voting weights and allPars contains the parameters of all weak learners. 2. Implement the functions train and classify for decision stumps. 3. Run your algorithm on the USPS data (the digit data we used in Homework 2) and evaluate your results using cross validation. More precisely, your AdaBoost algorithm returns a classifier that is a combination of B weak learners. Since it is an incremental learning, we can evaluate the AdaBoost at every iteration b by considering the sum up to the b-th weak learner. At each iteration, perform 5-fold cross validation to estimate the training and test error of the current classifier (that is, the errors measured on the cross validation training and test sets, respectively). 4. Plot the training error and the test error as a function of b. Submission. Please make sure your solution contains the following: Your implementation for train, classify and agg class. Your implementation of AdaBoost. Plots of your results (training error and cross-validated test error). 3