Questions and Answers of Pattern Recognition And Machine Learning

13.2 Stability analysis for L2-regularized conditional Maxent.(a) Give an upper bound on the stability of the L2-regularized conditional Maxent in terms of the sample size and (Hint: use the
13.1 Extension to Bregman divergences.(a) Show how conditional Maxent models can be extended by using arbitrary Bregman divergences instead of the (unnormalized) relative entropy.(b) Prove a duality
12.5 L2-regularization. Let w be the solution of Maxent with a norm-2 squared regularization.(a) Prove the following inequality: kwk2 2r (Hint: you could compare the values of the objective
12.4 Extension to Bregman divergences. Derive theoretical guarantees for the extensions discussed in Section 12.8. What additional property is needed for the Bregman divergence so that your learning
12.3 Dual of norm-2 squared regularized Maxent. Derive the dual formulation of the norm-2 squared regularized Maxent optimization shown in equation (12.16).
12.2 Lagrange duality. Derive the dual problem of the Maxent problem and justify it carefully in the case of the stricter constraint of positivity for the distribution p: p(x) > 0 for all x 2 X.
12.1 Convexity. Prove directly that the function w 7! log Z(w) = log(P x2X ew(x))is convex (Hint: compute its Hessian).
11.11 On-line quadratic SVR. Derive an on-line algorithm for the quadratic SVR algorithm (provide the full pseudocode).
11.10 On-line Lasso. Use the formulation (11.33) of the optimization problem of Lasso and stochastic gradient descent (see section 8.3.1) to show that the problem can be solved using the on-line
11.9 Leave-one-out error. In general, the computation of the leave-one-out error can be very costly since, for a sample of size m, it requires training the algorithm m times. The objective of this
11.8 Optimal kernel matrix. Suppose in addition to optimizing the dual variables 2 Rm, as in (11.16), we also wish to optimize over the entries of the PDS kernel matrix K 2 Rmm.min K0 max????>
11.7 SVR dual formulations. Give a detailed and carefully justied derivation of the dual formulations of the SVR algorithm both for the -insensitive loss and the quadratic -insensitive loss.
11.6 SVR and squared loss. Assuming that 2r 1, use theorem 11.13 to derive a generalization bound for the squared loss.
11.5 Huber loss. Derive the primal and dual optimization problem used to solve the SVR problem with the Huber loss:Lc(i) =(1 2 2 i ; if jij c ci ???? 1 2 c2; otherwise;where i = w (xi) + b
11.4 Perturbed kernels. Suppose two dierent kernel matrices, K and K0, are used to train two kernel ridge regression hypothesis with the same regularization parameter . In this problem, we will
11.3 Linear regression.(a) What condition is required on the data X in order to guarantee that XX>is invertible?(b) Assume the problem is under-determined. Then, we can choose a solution w such that
11.2 Pseudo-dimension of linear functions. Let H be the set of all linear functions in dimensiond, i.e. h(x) = w>x for some w 2 Rd. Show that Pdim(H) = d.
11.1 Pseudo-dimension and monotonic functions.Assume that is a strictly monotonic function and let H be the family of functions dened by H = f(h()) : h 2 Hg, where H is some set of
10.10 k-partite weight function. Show how the weight function ! can be dened so that L! encodes the natural loss function associated to a k-partite ranking scenario.
10.9 Deviation bound for the AUC. Let h be a xed scoring function used to rank the points of X. Use Hoeding's bound to show that with high probability the AUC of h for a nite sample is close to its
10.8 Multipartite ranking. Consider the ranking scenario in a k-partite setting where X is partitioned into k subsets X1; : : : ;Xk with k 1. The bipartite case (k = 2)is already specically
10.7 Bipartite ranking. Suppose that we use a binary classier for ranking in the bipartite setting. Prove that if the error of the binary classier is , then that of the ranking it induces is also
10.6 Margin-maximization ranking. Give a linear programming (LP) algorithm returning a linear hypothesis for pairwise ranking based on margin maximization.
10.5 RankPerceptron. Adapt the Perceptron algorithm to derive a pairwise ranking algorithm based on a linear scoring function. Assume that the training sample is linear separable for pairwise
10.4 Margin maximization and RankBoost. Give an example showing that Rank-Boost does not achieve the maximum margin, as in the case of AdaBoost.
10.3 Empirical margin loss of RankBoost. Derive an upper bound on the empirical pairwise ranking margin loss of RankBoost similar to that of theorem 7.7 for AdaBoost.
10.2 On-line ranking. Give an on-line version of the SVM-based ranking algorithm presented in section 10.3.
10.1 Uniform margin-bound for ranking. Use theorem 10.1 to derive a margin-based learning bound for ranking that holds uniformly for all > 0 (see similar binary classication bounds of theorem 5.9
9.6 Give an example where the generalization error of each of the k(k????1)=2 binary classiers hll0 , l 6= l0, used in the denition of the OVO technique is r and that of the OVO hypothesis (k ????
9.5 Decision trees. Show that VC-dimension of a binary decision tree with n nodes in dimension N is in O(n logN).
9.4 Multi-class algorithm based on RankBoost. This problem requires familiarity with the material presented both in this chapter and in chapter 10. An alternative boosting-type multi-class
9.3 Alternative multi-class boosting algorithm. Consider the objective function G dened for any sample S = ((x1; y1); : : : ; (xm; ym)) 2 (X Y)m and =(1; : : : ; n) 2 Rn, n 1, by G() =Xm i=1
9.2 Multi-class classication with kernel-based hypotheses constrained by an Lp norm. Use corollary 9.4 to dene alternative multi-class classication algorithms with kernel-based hypotheses
9.1 Generalization bounds for multi-label case. Use similar techniques to those used in the proof of theorem 9.2 to derive a margin-based learning bound in the multi-label case.
8.11 On-line to batch | kernel Perceptron margin bound. In this problem, we give a margin-based generalization guarantee for the kernel Perceptron algorithm. Let h1; : : : ; hT be the sequence of
8.9 General inequality. In this exercise we generalize the result of exercise 8.7 by using a more general inequality: log(1 ???? x) ????x ???? x2 for some 0 < < 2.(a) First prove that the
8.8 Polynomial weighted algorithm. The objective of this problem is to show how another regret minimization algorithm can be dened and studied. Let L be a loss function convex in its rst argument
8.7 Second-order regret bound. Consider the randomized algorithm that diers from the RWMalgorithm only by the weight update, i.e., wt+1;i (1????(1????)lt;i)wt;i, t 2 [T], which is applied to all i
8.6 Margin Perceptron. Given a training sample S that is linearly separable with a maximum margin > 0, theorem 8.8 states that the Perceptron algorithm run cyclically over S is guaranteed to
8.5 On-line SVM algorithm. Consider the algorithm described in gure 8.11. Show that this algorithm corresponds to the stochastic gradient descent technique applied to the SVM problem (5.24) with
8.4 Tightness of lower bound. Is the lower bound of theorem 8.5 tight? Explain why or show a counter-example.
8.3 Sparse instances. Suppose each input vector xt, t 2 [T], coincides with the tth unit vector of RT . How many updates are required for the Perceptron algorithm to converge? Show that the number of
8.2 Generalized mistake bound. Theorem 8.8 presents a margin bound on the maximum number of updates for the Perceptron algorithm for the special case = 1.Consider now the general Perceptron update
8.1 Perceptron lower bound. Let S be a labeled sample of m points in RN with xi = ((????1)i; : : : ; (????1)i; (????1)i+1| {z }i rst components; 0; : : : ; 0) and yi = (????1)i+1: (8.30)Show that the
7.12 Empirical margin loss boosting. As discussed in the chapter, AdaBoost can be viewed as coordinate descent applied to a convex upper bound on the empirical error. Here, we consider an algorithm
7.10 Boosting in the presence of unknown labels. Consider the following variant of the classication problem where, in addition to the positive and negative labels +1 and ????1, points may be labeled
7.9 AdaBoost example.In this exercise we consider a concrete example that consists of eight training points and eight weak classiers.(a) Dene an mn matrix M where Mij = yihj(xi), i.e., Mij = +1 if
7.8 Simplied AdaBoost. Suppose we simplify AdaBoost by setting the parameter t to a xed value t = > 0, independent of the boosting round t.(a) Let be such that ( 1 2 ???? t) > 0. Find the best
7.7 Noise-tolerant AdaBoost. AdaBoost may be signicantly overtting in the presence of noise, in part due to the high penalization of misclassied examples. To reduce this eect, one could use
7.6 Fix 2 (0; 1=2). Let the training sample be dened by m points in the plane with m 4 negative points all at coordinate (1; 1), another m 4 negative points all at coordinate (????1;????1),
7.5 Dene the unnormalized correlation of two vectors x and x0 as the inner product between these vectors. Prove that the distribution vector (Dt+1(1); : : : ;Dt+1(m))dened by AdaBoost and the
7.4 Weighted instances. Let the training sample be S = ((x1; y1); : : : ; (xm; ym)).Suppose we wish to penalize dierently errors made on xi versus xj . To do that, we associate some non-negative
7.3 Update guarantee. Assume that the main weak learner assumption of AdaBoost holds. Let ht be the base learner selected at round t. Show that the base learner ht+1 selected at round t + 1 must be
7.2 Alternative objective functions. This problem studies boosting-type algorithms dened with objective functions dierent from that of AdaBoost. We assume that the training data are given as m
7.1 VC-dimension of the hypothesis set of AdaBoost. Prove the upper bound on the VC-dimension of the hypothesis set FT of AdaBoost after T rounds of boosting, as stated in equation (7.9).
6.22 Anomaly detection. For this problem, consider a Hilbert space H with associated feature map : X ! H and kernel K(x; x0) = (x) (x0).(a) First, let us consider nding the smallest enclosing
6.21 Mercer's condition. Let X RN be a compact set and K: X X ! R a continuous kernel function. Prove that if K veries Mercer's condition (theorem 6.2), then it is PDS. (Hint: assume that K is
6.20 n-gram kernel. Show that for all n 1, and any n-gram kernel Kn, Kn(x; y)can be computed in linear time O(jxj + jyj), for all x; y 2 assuming n and the alphabet size are constants.
6.19 Sequence kernels. Let X = fa; c; g; tg. To classify DNA sequences using SVMs, we wish to dene a kernel between sequences dened over X. We are given a nite set I X of non-coding regions
6.18 Metrics and Kernels. Let X be a non-empty set and K: XX ! R be a negative denite symmetric kernel such that K(x; x) = 0 for all x 2 X.(a) Show that there exists a Hilbert space H and a mapping
6.17 Relationship between NDS and PDS kernels. Prove the statement of theorem 6.17. (Hint: Use the fact that if K is PDS then exp(K) is also PDS, along with theorem 6.16.)
6.16 Fraud detection. To prevent fraud, a credit-card company decides to contact Professor Villebanque and provides him with a random list of several thousand fraudulent and non-fraudulent events.
6.15 Image classication kernel. For 0, the kernel K : (x; x0) 7!XN k=1 min(jxkj; jx0kj) (6.30)over RN RN is used in image classication. Show that K is PDS for all 0. To do so, proceed as
6.14 Classier based kernel. Let S be a training sample of size m. Assume that S has been generated according to some probability distribution D(x; y), where(x; y) 2 X f????1; +1g.(a) Dene the
6.13 High-dimensional mapping. Let : X ! H be a feature mapping such that the dimension N of H is very large and let K: XX ! R be a PDS kernel dened by K(x; x0) = E iD[(x)]i[(x0)]i;
6.12 Explicit polynomial kernel mapping. Let K be a polynomial kernel of degreed, i.e., K: RN RN ! R, K(x; x0) = (x x0 + c)d, with c > 0, Show that the dimension of the feature space associated
6.11 Explicit mappings.(a) Denote a data set x1; : : : ; xm and a kernel K(xi; xj) with a Gram matrix K. Assuming K is positive semidenite, then give a map () such that K(xi; xj) = h(xi);
6.10 For any p > 0, let Kp be the kernel dened over R+ R+ by Kp(x; y) = e????(x+y)p: (6.25)Show that Kp is positive denite symmetric (PDS) i p 1. (Hint: you can use the fact that if K is NDS,
6.9 Let H be a Hilbert space with the corresponding dot product h; i. Show that the kernel K dened over H H by K(x; y) = 1 ???? hx; yi is negative denite.
6.8 Is the kernel K dened over Rn Rn by K(x; y) = kx????yk3=2 PDS? Is it NDS?
6.7 Dene a dierence kernel as K(x; x0) = jx ???? x0j for x; x0 2 R. Show that this kernel is not positive denite symmetric (PDS).
6.6 Show that the following kernels K are NDS:(a) K(x; y) = [sin(x ???? y)]2 over R R.(b) K(x; y) = log(x + y) over (0;+1) (0;+1).
6.5 Set kernel. Let X be a nite set. Let K0 be a PDS kernel over X, show that K0 dened by 8A;B 2 2X;K0(A;B) =X x2A;x02B K0(x; x0)is a PDS kernel.
6.4 Symmetric dierence kernel. Let X be a nite set. Show that the kernel K dened over 2X, the set of subsets of X, by 8A;B 2 2X;K(A;B) = exp????1 2jABj;where AB is the symmetric dierence of A
6.3 Graph kernel. Let G = (V; E) be an undirected graph with vertex set V and edge set E. V could represent a set of documents or biosequences and E the set of connections between them. Let w[e] 2 R
6.2 Show that the following kernels K are PDS:(a) K(x; y) = cos(x ???? y) over R R.(b) K(x; y) = cos(x2 ???? y2) over R R.(c) For all integers n > 0;K(x; y) =PN i=1 cosn(x2i???? y2 i ) over RN
6.1 Let K: X X ! R be a PDS kernel, and let : X ! R be a positive function.Show that the kernel K0 dened for all x; y 2 X by K0(x; y) = K(x;y)(x)(y) is a PDS kernel.
5.7 VC-dimension of canonical hyperplanes. The objective of this problem is derive a bound on the VC-dimension of canonical hyperplanes that does not depend on the dimension of feature space. Let S
5.6 Sparse SVM. One can give two types of arguments in favor of the SVM algorithm:one based on the sparsity of the support vectors, another based on the notion of margin. Suppose that instead of
5.5 SVMs hands-on.(a) Download and install the libsvm software library from:http://www.csie.ntu.edu.tw/~cjlin/libsvm/.(b) Download the satimage data set found
5.4 Sequential minimal optimization (SMO). The SMO algorithm is an optimization algorithm introduced to speed up the training of SVMs. SMO reduces a(potentially) large quadratic programming (QP)
5.3 Importance weighted SVM. Suppose you wish to use SVMs to solve a learning problem where some training data points are more important than others. More formally, assume that each training point
5.2 Tighter Rademacher Bound. Derive the following tighter version of the bound of theorem 5.9: for any > 0, with probability at least 1 ???? , for all h 2 H and 2 (0; 1] the following holds:R(h)
5.1 Soft margin hyperplanes. The function of the slack variables used in the optimization problem for soft margin hyperplanes has the form: 7!Pm i=1 i.Instead, we could use 7!Pm i=1 p i , with
4.5 Same questions as in Exercise 4.5 with the loss of h: X ! R at point (x; y) 2 X f????1; +1g dened instead to be 1yh(x)
4.4 In this problem, the loss of h: X ! R at point (x; y) 2 X f????1; +1g is dened to be 1yh(x)0.(a) Dene the Bayes classier and a Bayes scoring function h for this loss.(b) Express the excess
4.3 Show that for the squared Hinge loss, (u) = max(0; 1 + u)2, the statement of Theorem 4.7 holds with s = 2 and c = 1 2 and therefore that the excess error can be upper bounded as follows:R(h)
4.2 Show that for the squared loss, (u) = (1 + u)2, the statement of Theorem 4.7 holds with s = 2 and c = 1 2 and therefore that the excess error can be upper bounded as follows:R(h) ???? R L(h)
4.1 For any hypothesis set H, show that the following inequalities hold:E SDm hbR S????hERM S i inf h2H R(h) E SDm hR????hERM S i: (4.13)
3.31 Generalization bound based on covering numbers. Let H be a family of functions mapping X to a subset of real numbers Y R. For any > 0, the covering number N(H; ) of H for the L1 norm is the
3.30 VC-dimension generalization bound { realizable case. In this exercise we show that the bound given in corollary 3.19 can be improved to O( d log(m=d)m ) in the realizable setting. Assume we are
3.29 Innite VC-dimension.(a) Show that if a concept class C has innite VC-dimension, then it is not PAC-learnable.(b) In the standard PAC-learning scenario, the learning algorithm receives all
3.28 VC-dimension of convex combinations. Let H be a family of functions mapping from an input space X to f????1; +1g and let T be a positive integer. Give an upper bound on the VC-dimension of the
3.27 VC-dimension of neural networks.Let C be a concept class over Rr with VC-dimensiond. A C-neural network with one intermediate layer is a concept dened over Rn that can be represented by a
3.26 Symmetric functions. A function h: f0; 1gn ! f0; 1g is symmetric if its value is uniquely determined by the number of 1's in the input. Let C denote the set of all symmetric functions.(a)
3.25 VC-dimension of symmetric dierence of concepts. For two sets A and B, let AB denote the symmetric dierence of A and B, i.e., AB = (A[B)????(A\B).Let H be a non-empty family of subsets of X
3.24 VC-dimension of union of concepts. Let A and B be two sets of functions mapping from X into f0; 1g, and assume that both A and B have nite VCdimension, with VCdim(A) = dA and VCdim(B) = dB. Let
3.23 VC-dimension of intersection concepts.(a) Let C1 and C2 be two concept classes. Show that for any concept class C = fc1 \ c2 : c1 2 C1; c2 2 C2g,C(m) C1 (m)C2 (m): (3.53)(b) Let C be a
3.22 VC-dimension of intersection of halfspaces. Consider the class Ck of convex intersections of k halfspaces. Give lower and upper bound estimates for VCdim(Ck).
3.21 VC-dimension of union of halfspaces. Provide an upper bound on the VCdimension of the class of hypotheses described by the unions of k halfspaces.

Showing 200 - 300 of 816