Questions and Answers of Pattern Recognition And Machine Learning

6.22 ( ) Consider a regression problem with N training set input vectors x1, . . . , xN and L test set input vectors xN+1, . . . , xN+L, and suppose we define a Gaussian process prior over functions
6.21 ( ) www Consider a Gaussian process regression model in which the kernel function is defined in terms of a fixed set of nonlinear basis functions. Show that the predictive distribution is
6.20 ( ) www Verify the results (6.66) and (6.67).
6.19 ( ) Another viewpoint on kernel regression comes from a consideration of regression problems in which the input variables as well as the target variables are corrupted with additive noise.
6.18 () Consider a Nadaraya-Watson model with one input variable x and one target variable t having Gaussian components with isotropic covariances, so that the covariance matrix is given by σ2I
6.17 ( ) www Consider the sum-of-squares error function (6.39) for data having noisy inputs, where ν(ξ) is the distribution of the noise. Use the calculus of variations to minimize this error
6.16 ( ) Consider a parametric model governed by the parameter vector w together with a data set of input values x1, . . . , xN and a nonlinear feature mapping φ(x).Suppose that the dependence of
6.15 () By considering the determinant of a 2 × 2 Gram matrix, show that a positivedefinite kernel function k(x, x) satisfies the Cauchy-Schwartz inequality k(x1, x2)2 k(x1, x1)k(x2, x2). (6.96)
6.14 () www Write down the form of the Fisher kernel, defined by (6.33), for the case of a distribution p(x|μ) = N(x|μ, S) that is Gaussian with mean μ and fixed covariance S.
6.13 () Show that the Fisher kernel, defined by (6.33), remains invariant if we make a nonlinear transformation of the parameter vector θ → ψ(θ), where the functionψ(·) is invertible and
6.12 ( ) www Consider the space of all possible subsets A of a given fixed set D.Show that the kernel function (6.27) corresponds to an inner product in a feature space of dimensionality 2|D| defined
6.11 () By making use of the expansion (6.25), and then expanding the middle factor as a power series, show that the Gaussian kernel (6.23) can be expressed as the inner product of an
6.10 () Show that an excellent choice of kernel for learning a function f(x) is given by k(x, x) = f(x)f(x) by showing that a linear learning machine based on this kernel will always find a
6.9 () Verify the results (6.21) and (6.22) for constructing valid kernels.
6.8 () Verify the results (6.19) and (6.20) for constructing valid kernels.
6.7 () www Verify the results (6.17) and (6.18) for constructing valid kernels.
6.6 () Verify the results (6.15) and (6.16) for constructing valid kernels.
6.5 () www Verify the results (6.13) and (6.14) for constructing valid kernels.
6.4 () In Appendix C, we give an example of a matrix that has positive elements but that has a negative eigenvalue and hence that is not positive definite. Find an example of the converse property,
6.3 () The nearest-neighbour classifier (Section 2.5.2) assigns a new input vector x to the same class as that of the nearest input vector xn from the training set, where in the simplest case, the
6.2 ( ) In this exercise, we develop a dual formulation of the perceptron learning algorithm. Using the perceptron learning rule (4.55), show that the learned weight vector w can be written as a
6.1 ( ) www Consider the dual formulation of the least squares linear regression problem given in Section 6.1. Show that the solution for the components an of the vector a can be expressed as a
5.41 ( ) By following analogous steps to those given in Section 5.7.1 for regression networks, derive the result (5.183) for the marginal likelihood in the case of a network having a cross-entropy
5.40 () www Outline the modifications needed to the framework for Bayesian neural networks, discussed in Section 5.7.3, to handle multiclass problems using networks having softmax output-unit
5.39 () www Make use of the Laplace approximation result (4.135) to show that the evidence function for the hyperparameters α and β in the Bayesian neural network model can be approximated by
5.38 () Using the general result (2.115), derive the predictive distribution (5.172) for the Laplace approximation to the Bayesian neural network model.
5.37 () Verify the results (5.158) and (5.160) for the conditional mean and variance of the mixture density network model.
5.36 () Derive the result (5.157) for the derivative of the error function with respect to the network output activations controlling the component variances in the mixture density network.
5.35 () Derive the result (5.156) for the derivative of the error function with respect to the network output activations controlling the component means in the mixture density network.
5.34 () www Derive the result (5.155) for the derivative of the error function with respect to the network output activations controlling the mixing coefficients in the mixture density network.
5.33 () Write down a pair of equations that express the Cartesian coordinates (x1, x2)for the robot arm shown in Figure 5.18 in terms of the joint angles θ1 and θ2 and the lengths L1 and L2 of the
5.32 ( ) Show that the derivatives of the mixing coefficients {πk}, defined by (5.146), with respect to the auxiliary parameters {ηj} are given by∂πk∂ηj= δjkπj − πjπk. (5.208)Hence, by
5.31 () Verify the result (5.143).
5.30 () Verify the result (5.142).
5.29 () www Verify the result (5.141).
5.28 () www Consider a neural network, such as the convolutional network discussed in Section 5.5.6, in which multiple weights are constrained to have the same value.Discuss how the standard
5.27 ( ) www Consider the framework for training with transformed data in the special case in which the transformation consists simply of the addition of random noise x → x + ξ where ξ has a
5.26 ( ) Consider a multilayer perceptron with arbitrary feed-forward topology, which is to be trained by minimizing the tangent propagation error function (5.127) in which the regularizing function
5.25 ( ) www Consider a quadratic error function of the form E = E0 +1 2(w − w)TH(w − w) (5.195)where w represents the minimum, and the Hessian matrixHis positive definite and constant.
5.24 () Verify that the network function defined by (5.113) and (5.114) is invariant under the transformation (5.115) applied to the inputs, provided the weights and biases are simultaneously
5.23 ( ) Extend the results of Section 5.4.5 for the exact Hessian of a two-layer network to include skip-layer connections that go directly from inputs to outputs.
5.22 ( ) Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessian matrix of a two-layer feed-forward network by application of the chain rule of calculus.
5.21 ( ) Extend the expression (5.86) for the outer product approximation of the Hessian matrix to the case ofK > 1 output units. Hence, derive a recursive expression analogous to (5.87) for
5.20 () Derive an expression for the outer product approximation to the Hessian matrix for a network having K outputs with a softmax output-unit activation function and a cross-entropy error
5.19 () www Derive the expression (5.85) for the outer product approximation to the Hessian matrix for a network having a single output with a logistic sigmoid output-unit activation function and a
5.18 () Consider a two-layer network of the form shown in Figure 5.1 with the addition of extra parameters corresponding to skip-layer connections that go directly from the inputs to the outputs. By
5.17 () Consider a squared loss function of the form E =1 2{y(x,w) − t}2 p(x, t) dx dt (5.193)where y(x,w) is a parametric function such as a neural network. The result (1.89)shows that the
5.16 () The outer product approximation to the Hessian matrix for a neural network using a sum-of-squares error function is given by (5.84). Extend this result to the case of multiple outputs.
5.15 ( ) In Section 5.3.4, we derived a procedure for evaluating the Jacobian matrix of a neural network using a backpropagation procedure. Derive an alternative formalism for finding the Jacobian
5.14 () By making a Taylor expansion, verify that the terms that are O() cancel on the right-hand side of (5.69).
5.13 () Show that as a consequence of the symmetry of the Hessian matrix H, the number of independent elements in the quadratic error function (5.28) is given by W(W + 3)/2.
5.12 ( ) www By considering the local Taylor expansion (5.32) of an error function about a stationary point w, show that the necessary and sufficient condition for the stationary point to be a local
5.11 ( ) www Consider a quadratic error function defined by (5.32), in which the Hessian matrix H has an eigenvalue equation given by (5.33). Show that the contours of constant error are ellipses
5.10 () www Consider a Hessian matrix H with eigenvector equation (5.33). By setting the vector v in (5.39) equal to each of the eigenvectors ui in turn, show that H is positive definite if, and
5.9 () www The error function (5.21) for binary classification problems was derived for a network having a logistic-sigmoid output activation function, so that 0 y(x,w) 1, and data having target
5.8 () We saw in (4.88) that the derivative of the logistic sigmoid activation function can be expressed in terms of the function value itself. Derive the corresponding result for the ‘tanh’
5.7 () Show the derivative of the error function (5.24) with respect to the activation ak for output units having a softmax activation function satisfies (5.18).
5.6 () www Show the derivative of the error function (5.21) with respect to the activation ak for an output unit having a logistic sigmoid activation function satisfies(5.18).
5.5 () www Show that maximizing likelihood for a multiclass neural network model in which the network outputs have the interpretation yk(x,w) = p(tk = 1|x) is equivalent to the minimization of the
5.4 ( ) Consider a binary classification problem in which the target values are t ∈{0, 1}, with a network output y(x,w) that represents p(t = 1|x), and suppose that there is a probability that
5.3 ( ) Consider a regression problem involving multiple target variables in which it is assumed that the distribution of the targets, conditioned on the input vector x, is a Gaussian of the form
5.2 () www Show that maximizing the likelihood function under the conditional distribution (5.16) for a multioutput neural network is equivalent to minimizing the sum-of-squares error function
5.1 ( ) Consider a two-layer network function of the form (5.7) in which the hiddenunit nonlinear activation functions g(·) are given by logistic sigmoid functions of the formσ(a) = {1 +
4.26 ( ) In this exercise, we prove the relation (4.152) for the convolution of a probit function with a Gaussian distribution. To do this, show that the derivative of the lefthand side with respect
4.25 ( ) Suppose we wish to approximate the logistic sigmoid σ(a) defined by (4.59)by a scaled probit function Φ(λa), where Φ(a) is defined by (4.114). Show that ifλ is chosen so that the
4.24 ( ) Use the results from Section 2.3.2 to derive the result (4.151) for the marginalization of the logistic regression model with respect to a Gaussian posterior distribution over the parameters
4.23 ( ) www In this exercise, we derive the BIC result (4.139) starting from the Laplace approximation to the model evidence given by (4.137). Show that if the prior over parameters is Gaussian of
4.22 () Using the result (4.135), derive the expression (4.137) for the log model evidence under the Laplace approximation.
4.21 () Show that the probit function (4.114) and the erf function (4.115) are related by(4.116).
4.20 ( ) Show that the Hessian matrix for the multiclass logistic regression problem, defined by (4.110), is positive semidefinite. Note that the full Hessian matrix for this problem is of size MK
4.19 () www Write down expressions for the gradient of the log likelihood, as well as the corresponding Hessian matrix, for the probit regression model defined in Section 4.3.5. These are the
4.18 () Using the result (4.91) for the derivatives of the softmax activation function, show that the gradients of the cross-entropy error (4.108) are given by (4.109).
4.17 () www Show that the derivatives of the softmax activation function (4.104), where the ak are defined by (4.105), are given by (4.106).
4.16 () Consider a binary classification problem in which each observation xn is known to belong to one of two classes, corresponding to t = 0 and t = 1, and suppose that the procedure for
4.15 ( ) Show that the Hessian matrix H for the logistic regression model, given by(4.97), is positive definite. Here R is a diagonal matrix with elements yn(1 − yn), and yn is the output of the
4.14 () Show that for a linearly separable data set, the maximum likelihood solution for the logistic regression model is obtained by finding a vector w whose decision boundary wTφ(x) = 0 separates
4.13 () www By making use of the result (4.88) for the derivative of the logistic sigmoid, show that the derivative of the error function (4.90) for the logistic regression model is given by (4.91).
4.12 () www Verify the relation (4.88) for the derivative of the logistic sigmoid function defined by (4.59).
4.11 ( ) Consider a classification problem with K classes for which the feature vectorφ has M components each of which can take L discrete states. Let the values of the components be represented by
4.10 ( ) Consider the classification model of Exercise 4.9 and now suppose that the class-conditional densities are given by Gaussian distributions with a shared covariance matrix, so that p(φ|Ck) =
4.9 () www Consider a generative classification model for K classes defined by prior class probabilities p(Ck) = πk and general class-conditional densities p(φ|Ck)where φ is the input feature
4.8 () Using (4.57) and (4.58), derive the result (4.65) for the posterior class probability in the two-class generative model with Gaussian densities, and verify the results(4.66) and (4.67) for
4.7 () www Show that the logistic sigmoid function (4.59) satisfies the propertyσ(−a) = 1 − σ(a) and that its inverse is given by σ−1(y) = ln{y/(1 − y)}.
4.6 () Using the definitions of the between-class and within-class covariance matrices given by (4.27) and (4.28), respectively, together with (4.34) and (4.36) and the choice of target values
4.5 () By making use of (4.20), (4.23), and (4.24), show that the Fisher criterion (4.25)can be written in the form (4.26).
4.4 () www Show that maximization of the class separation criterion given by (4.23)with respect to w, using a Lagrange multiplier to enforce the constraint wTw = 1, leads to the result that w ∝
4.3 ( ) Extend the result of Exercise 4.2 to show that if multiple linear constraints are satisfied simultaneously by the target vectors, then the same constraints will also be satisfied by the
4.2 ( ) www Consider the minimization of a sum-of-squares error function (4.15), and suppose that all of the target vectors in the training set satisfy a linear constraint aTtn + b = 0 (4.157)where
4.1 ( ) Given a set of data points {xn}, we can define the convex hull to be the set of all points x given by x =nαnxn (4.156)where αn 0 andn αn = 1. Consider a second set of points {yn}
3.24 ( ) Repeat the previous exercise but now use Bayes’ theorem in the form p(t) = p(t|w, β)p(w, β)p(w, β|t)(3.119)and then substitute for the prior and posterior distributions and the
3.23 ( ) www Show that the marginal probability of the data, in other words the model evidence, for the model described in Exercise 3.12 is given by p(t) =1(2π)N/2 ba0 0baN
3.22 ( ) Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to β leads to the re-estimation equation (3.95).
3.21 ( ) An alternative way to derive the result (3.92) for the optimal value of α in the evidence framework is to make use of the identity ddαln |A| = TrA−1 d dαA. (3.117)Prove this identity
3.20 ( ) www Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to α leads to the re-estimation equation (3.92).
3.19 ( ) Show that the integration over w in the Bayesian linear regression model gives the result (3.85). Hence show that the log marginal likelihood is given by (3.86).
3.18 ( ) www By completing the square over w, show that the error function (3.79)in Bayesian linear regression can be written in the form (3.80).
3.17 () Show that the evidence function for the Bayesian linear regression model can be written in the form (3.78) in which E(w) is defined by (3.79).
3.16 ( ) Derive the result (3.86) for the log evidence function p(t|α, β) of the linear regression model by making use of (2.115) to evaluate the integral (3.77) directly.
3.15 () www Consider a linear basis function model for regression in which the parametersα and β are set using the evidence framework. Show that the function E(mN) defined by (3.82) satisfies the
3.14 ( ) In this exercise, we explore in more detail the properties of the equivalent kernel defined by (3.62), where SN is defined by (3.54). Suppose that the basis functions φj(x) are linearly

Showing 600 - 700 of 816