Questions and Answers of Pattern Recognition And Machine Learning

14. [24] Let us consider the analysis carried out for defining the random weight initialization, which is summarized by Eq. (5.6.113). Generalize this property to the case of a generic neuron, either
13. [M16] Given a random variable W with uniform distribution in [−w .. + w], calculate its variance.
12. [M21] Let us consider Chebyshev’s inequality (5.6.111). First, discuss the reason why Ai can be regarded as a random variable with normal distribution. Second, provide a sharper bound that
11. [29] Suppose we are given an OHL net in a configuration in which all its weights are equal. Discuss the evolution of the learning process when beginning from this configuration by expressing the
10. [HM48] Beginning from the arguments given in Section 5.6.1, give formal conditions under which the learning in a deep network with rectifier units is local minima free.
9. [HM46] Let us consider condition (5.6.110). The proof that the error function is local minima free given in Section 5.6.1 assumes that X1 is a full rank matrix. While this is generally true
8. [M21] Let us consider an OHL net in the configuration where all its weights (including the biases) are null. Prove that this stationary point is not a local minimum.
7. [M21] In Section 5.6.1 the case of all null weights for neurons with the hyperbolic tangent has been discussed. It has been proven that both G1 and G2 are null for OHL networks.However, the
6. [M34] Based on the analyses of Section 5.6.1 complete the prove that OHL nets with single output for linearly-separable examples yield local minima free error functions by proving that the point
5. [M39] Extend the result given in Section 5.6.1 on local minima error surfaces under the hypothesis of OHL nets and linearly-separable examples to the case of deep nets, that is, with multiple
4. [M19] Extend the result given in Section 5.6.1 on local minima error surfaces under the hypothesis of OHL nets and linearly-separable examples to the case of multiple output.
3. [HM45] Based on the XOR5 learning task, let us consider any associated learning task such that A and C are fixed and B,D,E can move under the constraint of keeping the separation structure of
2. [M28] Look at Fig. 5.20C and prove that an infinitesimal rotation of the separation lines yields changes of a5,b, a5,d, and a5,e that are infinitesimals of higher order with respect to a5,a and
1. [M31] Let us consider the following loading problem:L =((−1, 0), 0.1), ((1, 0), 0.9), ((0, 1), 0.9), ((0,−5), 0.9).This is clearly a linearly-separable problem. Suppose we are using an
7. [M23] GivenL = (Σ2I −∇2)n prove that its Green function is a spline.
6. [M16] Consider the perceptual space X = R and the regularization operator P = d2 dx2 .Prove that L = d4 dx4, g(x) = |x|3.
5. [M15] Prove that for any given polynomial kernel k(x − z), there is no regularization operator L = PP such that Lk(α) = δ(α), that is, there is no associated Green function to k(x − z).
3. [M25] Consider the following minimization problem10 E(f ) =+1−1 f 2(x)1 − f(x)2 dx (4.4.107)where the function f : [−1. .1] → R is expected to satisfy the boundary conditions f (−1) = 0
2. [22] Discuss the relation between the formulation of learning according to Eq. (4.4.84), where f k is defined according to Eq. (4.4.85), with MMP with soft-constraints when choosing the hinge
18. [C19] On the basis of the discussion of local minima in the error function, carry out an experimental analysis to plot the error surface of the error function of a single neuron for
17. [C24] Plot the L-curve for Exercise 14 using a one-layer threshold-linear machine with regularization.
16. [C22] Use ridge regression in Exercise 12 using different polynomials and plot the effective number of parameters while changing μ.
15. [C24] Use ridge regression with λ → 0 as an initialization of the iterative algorithm described in Exercise 3.1–24 for the computation of ˆX+.
11. [17] Consider the problem of preprocessing in linear and linear-threshold machines.What is the role of α-scaling or of any linear map of the inputs? What happens when using ridge regression?
10. [20] Consider the perceptron Algorithm P, where we start from ˆwo = 0. Prove that the algorithm still halts with a separating solution and determine an upper bound to the number of mistakes.
9. [17] Consider the problem of pattern sorting. Discuss its solution using linear machines.In particular, discuss the condition y ∈ N (Qu⊥).
8. [M21] Consider online Algorithm G. Prove that, when considering the weight updating of any loop over the training set, the following properties hold true:(i) ∀κ, if |L| < ∞and η → 0 then
7. [M25] The convergence of batch-mode gradient descent established under condition(3.4.69) is based on the assumption15 ≥ 1 +d. What if this condition is violated?Propose a convergence analysis
6. [20] Suppose we set the bias b of a = wx + b for an LTU using the Heaviside function to any b ∈ R \ {0}. Can we always determine the same solution as in the case in which b is a free parameter?
4. [15] Consider a finite learning set L that is linearly separable. Can you conclude that they are also robustly separable, i.e., that Eq. (3.4.72)(ii) holds?#!# 5. [20] Consider the same question
2. [M15] How does the bound in Eq. (3.4.76) change if we assume that the examples xi live in a general Hilbert space H with the norm v=√(v, v) induced by a symmetric inner product.#!# 3. [22]
1. [15] Consider Algorithm 3.4.3P and discuss why in step P5 we have inserted the assignmentˆw t ← (wt, bt/R). In particular, what happens if we directly return the weights ˆwt as computed by the
6. [18] Given the same information source as in Exercise 5, let us consider the code based on the idea sketched below. Extend the above sketched idea in an algorithm23 for constructing the codes.
5. [19] Suppose we are given the information source I = {(a, 0.05), (b, 0.05), (c, 0.1), (d, 0.2), (e, 03), (f, 0.2), (g, 0.1)}.Let us consider the associated Huffman tree. Extend the above sketched
4. [16] Given the sequence21 x1 x2 x3 x4 . . .1 2 3 4 . . .which of the two following explanations is preferable?• x5 = 5, since ∀κ = 1, . . . , 4 : xκ = κ;• x5 = 29, since xκ = κ4 −
3. [18] Given two probability distributions p(x) and q(x), their Kullback–Leibler distance is defined as DL(p, q) := E(p log pq). (2.3.105)Prove that DL(p, q) ≥ 0. Suppose we define the symmetric
2. [20] Consider a linear machine which performs supervised learning. Suppose we regard its weights as probabilities, so that b + dκ=1 wi= 1. Formulate learning as MaxEnt, where the entropy is
1. [17] Consider the Berger’s Burgers problem in the cases in which E → 1.00 $ or E →3.00 $. Prove that β → +∞, that is, T = 1/β → 0. Provide an interpretation of this “cold
9. [25] Nowadays dominant approaches to machine learning and related disciplines like computer vision and natural language processing are based on benchmarks. However, as discussed in Section 1.3.1,
8. [19] To what extent do you think artificial neural networks, and particularly the multilayer networks presented in this chapter, represent a scientific model of human brain?
7. [20] Write one sentence to explain why spatiotemporal learning tasks are essentially outside the common pattern recognition model, where patterns are represented by x ∈ Rd .
6. [40] After having played with commercial conversational agents, which you can access, provide a qualitative description of their adaptation qualities to different tasks. Why are they mostly
5. [26] After having played with commercial conversational agents, which you can access, make a list of relevant cognitive features that they are missing with respect to humans performing the same
4. [25] Provide a qualitative description of the difficulties connected with the problem of extracting information from optical documents (e.g., an invoice) and ordinary pictures, like those that you
3. [22] Future challenges in machine learning are to devise methods capable of learning directly the χ = h ◦ f ◦ π without defining π and h in advance. Can you address the main problems that
2. [46] Future challenges in machine learning are to devise methods for pattern segmentation that make it possible to extract objects e ∈ E . Discuss qualitatively the main problems that we need to
1. [25] Segmentation techniques based on spatiotemporal locality are doomed to fail in most complex tasks of speech and vision understanding. However, they can be useful in restricted application
8.29 ( ) www Show that if the sum-product algorithm is run on a factor graph with a tree structure (no loops), then after a finite number of messages have been sent, there will be no pending messages.
8.28 ( ) www The concept of a pending message in the sum-product algorithm for a factor graph was defined in Section 8.4.7. Show that if the graph has one or more cycles, there will always be at
8.27 ( ) Consider two discrete variables x and y each having three possible states, for example x, y ∈ {0, 1, 2}. Construct a joint distribution p(x, y) over these variables having the property
8.26 () Consider a tree-structured factor graph over discrete variables, and suppose we wish to evaluate the joint distribution p(xa, xb) associated with two variables xa and xb that do not belong
8.25 ( ) In (8.86), we verified that the sum-product algorithm run on the graph in Figure 8.51 with node x3 designated as the root node gives the correct marginal for x2. Show that the correct
8.24 ( ) Show that the marginal distribution for the variables xs in a factor fs(xs) in a tree-structured factor graph, after running the sum-product message passing algorithm, can be written as the
8.23 ( ) www In Section 8.4.4, we showed that the marginal distribution p(xi) for a variable node xi in a factor graph is given by the product of the messages arriving at this node from neighbouring
8.22 () Consider a tree-structured factor graph, in which a given subset of the variable nodes form a connected subgraph (i.e., any variable node of the subset is connected to at least one of the
8.21 ( ) www Show that the marginal distributions p(xs) over the sets of variables xs associated with each of the factors fx(xs) in a factor graph can be found by first running the sum-product
8.20 () www Consider the message passing protocol for the sum-product algorithm on a tree-structured factor graph in which messages are first propagated from the leaves to an arbitrarily chosen root
8.19 ( ) Apply the sum-product algorithm derived in Section 8.4.4 to the chain-ofnodes model discussed in Section 8.4.1 and show that the results (8.54), (8.55), and(8.57) are recovered as a special
8.18 ( ) www Show that a distribution represented by a directed tree can trivially be written as an equivalent distribution over the corresponding undirected tree. Also show that a distribution
8.17 ( ) Consider a graph of the form shown in Figure 8.38 having N = 5 nodes, in which nodes x3 and x5 are observed. Use d-separation to show that x2 ⊥⊥ x5 | x3.Show that if the message passing
8.16 ( ) Consider the inference problem of evaluating p(xn|xN) for the graph shown in Figure 8.38, for all nodes n ∈ {1, . . . , N − 1}. Show that the message passing algorithm discussed in
8.15 ( ) www Show that the joint distribution p(xn−1, xn) for two neighbouring nodes in the graph shown in Figure 8.38 is given by an expression of the form (8.58).
8.14 () Consider a particular case of the energy function given by (8.42) in which the coefficients β = h = 0. Show that the most probable configuration of the latent variables is given by xi = yi
8.13 () Consider the use of iterated conditional modes (ICM) to minimize the energy function given by (8.42). Write down an expression for the difference in the values of the energy associated with
8.12 () www Show that there are 2M(M−1)/2 distinct undirected graphs over a set of M distinct random variables. Draw the 8 possibilities for the case of M = 3.
8.11 ( ) Consider the example of the car fuel system shown in Figure 8.21, and suppose that instead of observing the state of the fuel gauge G directly, the gauge is seen by the driver D who reports
8.10 () Consider the directed graph shown in Figure 8.54 in which none of the variables is observed. Show that a ⊥⊥ b | ∅. Suppose we now observe the variabled. Show that in general a⊥ ⊥ b
8.9 () www Using the d-separation criterion, show that the conditional distribution for a node x in a directed graph, conditioned on all of the nodes in the Markov blanket, is independent of the
8.8 () www Show that a ⊥⊥b, c | d implies a ⊥⊥ b | d.
8.7 ( ) Using the recursion relations (8.15) and (8.16), show that the mean and covariance of the joint distribution for the graph shown in Figure 8.14 are given by (8.17)and (8.18), respectively.
8.6 () For the model shown in Figure 8.13, we have seen that the number of parameters required to specify the conditional distribution p(y|x1, . . . , xM), where xi ∈ {0, 1}, could be reduced from
8.5 () www Draw a directed probabilistic graphical model corresponding to the relevance vector machine described by (7.79) and (7.80).
8.4 ( ) Evaluate the distributions p(a), p(b|c), and p(c|a) corresponding to the joint distribution given in Table 8.2. Hence show by direct evaluation that p(a,b, c) =p(a)p(c|a)p(b|c). Draw the
8.3 ( ) Consider three binary variablesa, b, c ∈ {0, 1} having the joint distribution given in Table 8.2. Show by direct evaluation that this distribution has the property that a and b are
8.2 () www Show that the property of there being no directed cycles in a directed graph follows from the statement that there exists an ordered numbering of the nodes such that for each node there
8.1 () www By marginalizing out the variables in order, show that the representation(8.5) for the joint distribution of a directed graph is correctly normalized, provided each of the conditional
7.19 ( ) Verify that maximization of the approximate log marginal likelihood function(7.114) for the classification relevance vector machine leads to the result (7.116) for re-estimation of the
7.18 () www Show that the gradient vector and Hessian matrix of the log posterior distribution (7.109) for the classification relevance vector machine are given by(7.110) and (7.111).
7.17 ( ) Using (7.83) and (7.86), together with the matrix identity (C.7), show that the quantities Sn and Qn defined by (7.102) and (7.103) can be written in the form(7.106) and (7.107).
7.16 () By taking the second derivative of the log marginal likelihood (7.97) for the regression RVM with respect to the hyperparameter αi, show that the stationary point given by (7.101) is a
7.15 ( ) www Using the results (7.94) and (7.95), show that the marginal likelihood(7.85) can be written in the form (7.96), where λ(αn) is defined by (7.97) and the sparsity and quality factors
7.14 ( ) Derive the result (7.90) for the predictive distribution in the relevance vector machine for regression. Show that the predictive variance is given by (7.91).
7.13 ( ) In the evidence framework for RVM regression, we obtained the re-estimation formulae (7.87) and (7.88) by maximizing the marginal likelihood given by (7.85).Extend this approach by inclusion
7.12 ( ) www Show that direct maximization of the log marginal likelihood (7.85) for the regression relevance vector machine leads to the re-estimation equations (7.87)and (7.88) where γi is defined
7.11 ( ) Repeat the above exercise, but this time make use of the general result (2.115).
7.10 ( ) www Derive the result (7.85) for the marginal likelihood function in the regression RVM, by performing the Gaussian integral over w in (7.84) using the technique of completing the square in
7.9 () Verify the results (7.82) and (7.83) for the mean and covariance of the posterior distribution over weights in the regression RVM.
7.8 () www For the regression support vector machine considered in Section 7.1.4, show that all training data points for which ξn > 0 will have an = C, and similarly all points for whichξn > 0
7.7 () Consider the Lagrangian (7.56) for the regression support vector machine. By setting the derivatives of the Lagrangian with respect to w,b, ξn, andξn to zero and then back substituting to
7.6 () Consider the logistic regression model with a target variable t ∈ {−1, 1}. If we define p(t = 1|y) = σ(y) where y(x) is given by (7.1), show that the negative log likelihood, with the
7.5 ( ) Show that the values of ρ and {an} in the previous exercise also satisfy 1ρ2 = 2L(a) (7.124)whereL(a) is defined by (7.10). Similarly, show that 1ρ2 = w2. (7.125)
7.4 ( ) www Show that the value ρ of the margin for the maximum-margin hyperplane is given by 1ρ2 =N n=1 an (7.123)where {an} are given by maximizing (7.10) subject to the constraints (7.11)
7.3 ( ) Show that, irrespective of the dimensionality of the data space, a data set consisting of just two data points, one from each class, is sufficient to determine the location of the
7.2 () Show that, if the 1 on the right-hand side of the constraint (7.5) is replaced by some arbitrary constant γ > 0, the solution for the maximum margin hyperplane is unchanged.
7.1 ( ) www Suppose we have a data set of input vectors {xn} with corresponding target values tn ∈ {−1, 1}, and suppose that we model the density of input vectors within each class separately
6.27 ( ) Derive the result (6.90) for the log likelihood function in the Laplace approximation framework for Gaussian process classification. Similarly, derive the results(6.91), (6.92), and (6.94)
6.26 () Using the result (2.115), derive the expressions (6.87) and (6.88) for the mean and variance of the posterior distribution p(aN+1|tN) in the Gaussian process classification model.
6.25 () www Using the Newton-Raphson formula (4.92), derive the iterative update formula (6.83) for finding the mode aN of the posterior distribution in the Gaussian process classification model.
6.24 () Show that a diagonal matrixWwhose elements satisfy 0
6.23 ( ) www Consider a Gaussian process regression model in which the target variable t has dimensionality D. Write down the conditional distribution of tN+1 for a test input vector xN+1, given a

Showing 500 - 600 of 816