Questions and Answers of Pattern Recognition And Machine Learning

Derive the update formulae in Eqs. (12.25), (12.26), and (12.27) for B in Gaussian mixture HMMs.
Derive the update formula in Eq. (12.23) for B in discrete HMMs.
Run the Viterbi algorithm on a left-to-right HMM, where the transitions only go from one state to itself or to a higher-indexed state. Use a diagram as in Figure 12.13 to show how the HMM topology
Prove that t ¹iº and t ¹iº in an HMM satisfy Eq. (12.18) for any t.
Assume a GMM is given as follows: If we partition the vector x into two parts as x = xa ; xb , then do the following:a. Show that the marginal distribution p¹xaº is also a GMM, and find
Consider an m-dimensional variable r, whose elements are nonnegative integers. Suppose its distribution is described by a mixture of multinomial distributions:where the parameter pki denotes the
The index m in finite mixture models, as in Eq. (12.1), can be extended to be a continuous variable y 2 R:This is called an infinite mixture model if ¯w¹yº dy = 1 and ¯p¹x j , yº dx = 1 (8, y)
Prove that the auxiliary function Q¹j¹nºº is concave—namely, ????Q¹j¹nºº is convex—if we choose all component models in a finite mixture model as one of the following e-family
Prove that the exponential family is close under multiplication.
Determine whether the following generalized linear models belong to the exponential family:a. Logistic regressionb. Probit regressionc. Poisson regressiond. Log-linear models
Determine whether the following distributions belong to the exponential family:a. Dirichlet distributionb. Poisson distributionc. Inverse-Wishart distributiond. von Mises–Fisher distribution Derive
Derive the gradient-descent method for the MLE of all parameters of the log-linear models in Example 11.4.1.
Prove that the log-likelihood function of log-linear models in Eq. (11.15) is concave with a single global maximum.
Derive the gradient and Hessian matrix for the log-likelihood function of the Poisson regression in Eq.(11.13) and a learning algorithm for the MLE of its parameter w using (i) the gradient-descent
Derive the gradient for the log-likelihood function of the probit regression model in Eq. (11.12). Based on this, derive a learning algorithm for probit regression using the gradient-descent method.
Derive the MLE for first-order Markov chain models in Eq. (11.10).
Draw a graph representation similar to Figure 11.5 for a second-order Markov chain model of the DNA sequences.
Extend the MLE in Eq. (11.8) to a generic multinomial model involving M symbols.
Given x 2 Rn and y 2 f0, 1g, assume Pr¹y = kº = k > 0 for k = 0, 1 with ¹0 + 1 = 1º, and the conditional distribution of x given y is p¹x j yº = N¹x j y, yº, where 0, 1 2 Rn are two
Given K different classes (i.e.,!1,!2, ,!K), we assume each class !k ¹k = 1, 2, , Kº is modeled by a multivariate Gaussian distribution with the mean vector k and the covariance
Derive the MLE for multivariate Gaussian models with a diagonal covariance matrix, i.e. N¹xj, º with x, 2 Rd and Show the MLE of is the same as Eq. (11.3) and that of f1, , dg equals to the
Determine the condition(s) under which a beta distribution is unimodal.
Given a set of training samples DN =x1, x2, , xN, the so-called empirical distribution corresponding to DN is defined as follows:where ¹º denotes Dirac’s delta function. Show that the MLE is
Given a set of data samplesx1, x2, , xn, we assume the data follow an exponential distribution as follows:Derive the MLE for the parameter . p(x|0)= 0 otherwise.
Assume that we are allowed to reject an input as unrecognizable in a pattern-classification task. For an input x belonging to class !, we can define a new loss function for any decision rule g¹xº
Suppose we have three classes in two dimensions with the following underlying distributions:I Class !1: p¹xj!1º = N¹0, Iº.I Class !2: p¹xj!2º = N h 11 i, I.I Class !3: p¹xj!3º = 1 2N h 0.5
In the generative model p¹x,!º in Figure 10.2, assume the feature vector x consists of two parts, x =xg; xb, where xb denotes some missing components that cannot be observed for some
Derive the gradient-tree-boosting procedure using Newton boosting for a twice-differentiable loss functional l????F¹xº, y. Assume that we use the L2 norm term and the penalty per node in Eq. (9.2)
In a classification problem of K classes (i.e., f!1,!2, ,!K g), assume that we use an ensemble model for each class !k (for all k = 1, 2, , K) as follows:Fm¹x; !k º = f1¹x; !k º + f2¹x; !k
Derive the gradient-tree-boosting procedure for regression problems when the following loss functionals are used:a. The least absolute deviation:l????F¹xº, y=y ???? F¹xº.b. The Huber loss:
Derive the logitBoost algorithm by replacing the exponential loss in AdaBoost with the logistic loss:l????F¹xº, y= ln????1 + e????y F¹xº.
In AdaBoost, we define the error for a base model fm¹xº as m =Íyn,fm¹xnº ¯¹mºn . We normally have m .We then reweight the training samples for the next round asCompute the error of the same
In the AdaBoost Algorithm 9.10, assume we have learned a base model fm¹xº at step m that performs worse than random guessing (i.e., its error m > 1 2 ). If we simply flip it to ¯ fm¹xº = ????
Compared to a transformer, the feed-forward sequential memory network (FSMN) [262] is a more efficient model to convert a context-independent sequence into a context-dependent one. An FSMN uses the
Suppose that we have a multihead transformer as shown in Figure 8.27, where A¹ jº,B¹ jº 2 Rld,C¹ jº 2 Rod ¹ j = 1 Jº.a. Estimate the computational complexity of the forward pass of
Using the AD rules, derive the backward pass for the following layer connections:a. Time-delayed feedback in Figure 8.16b. Tapped delay line in Figure 8.17c. Attention in Figure 8.18
Following the derivation of batch normalization, derive the backward pass for layer normalization.
Use the AD rules to derive the backward-pass formulae in Eqs. (8.17) and (8.18) for multidimensional convolutions.
Unfold the following HORNN [228] into a feed-forward structure without using any feedback: Wout h = Whi (+ Wh2 (+ W10 Win X
In object recognition, translating an image by a few pixels in some direction should not affect the category recognized. Suppose that we consider images with an object in the foreground on top of a
Consider a simple CNN consisting of two hidden layers, each of which is composed of convolution and ReLU. These two hidden layers are then followed by a max-pooling layer and a softmax output
If we use the fully connected deep neural network in Figure 8.19 for a pattern-classification task that involves some nonexclusive classes, show how to configure the output layer and formulate the CE
Full connection and convolution are closely related:a. Show that convolution can be viewed as a special case of full connection where W and b take a particular form. What is this particular form of
Run linear regression, ridge regression, and LASSO on a small data set (e.g., the Boston Housing Dataset;https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) to experimentally compare the
In addition to the alternating Algorithm 7.6, derive the SGD algorithm to solve matrix factorization for any sparse matrix X. Assume X is huge but very sparse.
Derive the gradient descent methods to solve the ridge regression and LASSO.
The coordinate descent algorithm aims to optimize the objective function with respect to one free variable at a time. Derive the coordinate descent algorithm to solve LASSO.
Derive and compare the solutions to the ridge regression for the following two variants:a. The constrained norm:subject tob. The scaled norm:where > 0 is a preset constant. N min (wTxi - yi), W
Derive the closed-form solution to the ridge regression in Eq. (7.3).
Explain why the loss function is the rectified linear loss H0¹xº in perceptron and the sigmoid loss l¹xº in MCE.
In Problem SVM4, if we only optimize two multipliers i and j and keep all other multipliers constant, we can derive a closed-form solution to update i and j . This idea leads to the famous SMO for
Algorithm 6.5 is not optimal because it attempts to satisfy two constraints alternatively in each iteration.A better way is to compute an optimal step size at each step, which satisfies both
Show the mapping function corresponding to the RBF kernel (i.e., ¹xi , xj º = exp¹???? 1 2 j jxi ???? xj j j2º).
Show that the second-order polynomial kernel (i.e., ¹xi , xj º =????x|i xj + 12) corresponds to the following mapping function h¹xº from Rd to Rd¹d+1º:Then, consider the mapping function for a
Derive an efficient way to compute the matrix Q in the SVM formulation using the vectorization method(only involving vector/matrix operations without any loop or summation) for the following kernel
Based on the Lagrange dual function, show the procedure to derive dual problems for soft SVMs:a. Derive SVM4 from SVM3.b. Explain how to determine which training samples are support vectors in soft
Derive stochastic gradient descent algorithms to optimize the following linear models:a. Linear regressionb. Logistic regressionc. MCEd. Linear SVMs (Problem SVM1)e. Soft SVMs (Problem SVM3)
Extend the logistic regression method in Section 6.4 to deal with pattern-classification problems involving K > 2 classes.
Extend the MCE method in Section 6.3 to deal with pattern-classification problems involving K > 2 classes.
Given a training set DN =¹xi , yiº j i = 1, 2, Nwith xi 2 Rn and yi 2 f+1, ????1g for all i, assume we want to use a quadratic function y = x|Ax + b|x +c, where A 2 Rnn, b 2 Rn, and c 2 R,
Given a training set Dwith a separation margin 0, the original perceptron algorithm predicts a mistake when yw¹nº|x < 0. As we have discussed in Section 6.1, this algorithm converges to a linear
Extend the perceptron algorithm to an affine function y = w|x + b; also, revise the proof of Theorem 6.1.1 to accommodate the bias term b.
In an ML problem as specified in Section 5.1, we use f to denote the model obtained from the ERM procedure in Eq. (5.6):f = arg min f 2H Remp¹ f jDNº, and we use ˆ f to denote the best
Estimate the VC dimensions for the following simple model spaces:a. A model space of N distinct models, fA1, A2, , AN gb. An interval »a, b¼ on the real line with a bc. Two intervals »a,
Based on the concept of the VC dimension, explain why the memorization approach using an unbounded database in Example 5.2.1 is not learnable.
Q4.5 Derive the closed-form solutions for two error-minimization problems in LLE.
Q4.4 Use the method of Lagrange multipliers to derive the LDA solution.
Q4.3 Deriving the PCA under the minimum error formulation (II): Given a set of N vectors in an n-dimensional space: D =x1, x2, , xN(xi 2 Rn), we search for a complete orthonormal set of basis
Q4.2 Deriving the PCA under the minimum error formulation (I): Formulate each distance ei in Figure 4.7, and search for w to minimize the total errorÍi e2 i .
Q4.1 Use proof by induction to show that the m-dimensional PCA corresponds to the linear projection defined by the m eigenvectors of the sample covariance matrix S corresponding to the m largest
Derive compact and uncorrelated features to represent raw data.
Q2.16 Assume a differentiable objective function f ¹xº is Lipschitz continuous; namely, there exists a real constant L > 0, and for any two points x1 and x2, f ¹x1º ???? f ¹x2º L kx1 ???? x2 k
Q2.15 Compute the distance of a point x0 2 Rn toa. the surface of a unit ball: kxk = 1;b. a unit ball kxk 1;c. an elliptic surface x|Ax = 1, where A 2 Rnn and A 0; andd. an ovoid x|Ax 1, where
Q2.14 Prove Theorems 2.4.1, 2.4.2, and 2.4.3.
Q2.13 Given two multivariate Gaussian distributions: N¹x j 1, 1º and N¹x j 2, 2º, where 1 and 2 are the mean vectors, and 1 and 2 are the covariance matrices, derive the formula to
Q2.12 Assume a random vector x =x1 x2follows a bivariate Gaussian distribution: N¹x j , º, where =12is the mean vector and =2 1 1212 2 2is the covariance matrix. Derive the
Q2.11 Show that mutual information satisfies the following: I(X,Y) = H(X)-H(XY) H(Y)-H(Y|X) = H(X)+H(Y) H(X,Y).
Q2.10 Show that any two random variables X and Y are independent if and only if any one of the following equations holds: H(X,Y) = H(X) + H(Y) H(XY) = H(X) H(Y|X) = H(Y)
Q2.9 Assume a random vector x 2 Rn follows a multivariate Gaussian distribution (i.e., p¹xº = N????x, ).If we apply an invertible linear transformation to convert x into another random vector as
Q2.8 Assume n continuous random variables fX1, X2, , Xng jointly follow a multivariate Gaussian distribution N¹x j , º.a. For any random variable Xi (8i), derive its marginal distribution
Q2.7 Assume m continuous random variables fX1, X2, : : : , Xmg follow the Dirichlet distribution as follows:Derive the following results: I(+ Dir(P1, P2, Pm 2m) +1m) 1-11 = Pxpx.... -xp'm-1 T(1)
Q2.6 Consider a multinomial distribution of m discrete random variables as follows:a. Prove that the multinomial distribution satisfies the sum-to-1 constraint ÍX1, ,Xm Pr¹X1 = r1, X2 = r2, ,
Q2.5 For any matrix A 2 Rnn, if we use ai (i = 1, 2, : : : , n) to denote the ith column of the matrix A and use gi j = j cos i j j = jai aj j¹kai k kaj kº to denote the absolute cosine of the
Q2.4 Given x 2 Rn, z 2 Rm, and A 2 Rmn (m < n),a. prove that z|Ax = tr????x z|A,b. compute the derivative ¹@@xº kz ???? Axk2, andc. compute the derivative ¹@@Aº kz ???? Axk2.
Q2.3 GÍiven two sets of m vectors, xi 2 Rn and yi 2 Rn for all i = 1, 2, ,m, verify that the summations m i=1 xix|i andÍm i=1 xiy|i can be vectorized as the following matrix multiplications: m
Q2.2 For any two square matrices, X 2 Rnn and Y 2 Rnn, show thata. tr¹XYº = tr¹YXº, andb. tr¹X????1YXº = tr¹Yº if X is invertible.
Q2.1 Given two matrices, A 2 Rmn and B 2 Rmn, prove thatwhere ai j and bi j denote an element in the matrices A and B, respectively. m tr(ATB) = tr(ABT) = tr(BAT) = tr(BTA) = aij bij, i=1 j=1
Q1.2 A real-valued function f ¹xº (x 2 R) is said to be Lipschitz continuous if there exists a real constant L > 0, for any two points x1 2 R and x2 2 R, where f ¹x1º ???? f ¹x2º L jx1 ???? x2
Q1.1 Is the k-NN method parametric or nonparametric? Explain why.
1.1 What Is Machine Learning?
11. [CM18] Consider Algorithm Q for the simplest model of interaction based on asking the category of single patterns. The algorithm is based on the computation of Xw ←worst(X↓, nw) and on of
10. [HM50] Consider the dynamics arising from the day–night boundary conditions (6.5.162).What is the dependence on the initial values wi? Under which conditions after many days of life the agent
9. [M30] Discuss the dynamics for a developmental function in which ϕ( ˙ x) = 1/(1+ ˙x2).
8. [M30] Discuss the dynamics for a developmental function in which ϕ( ˙ x) = ˙x2.
7. [M22] Write down the equations of the agent behavior for supervised learning in the caseρ(t) = eθt .
6. [HM50] Reformulate the cognitive action and the corresponding learning problem by considering a generic class of functions f ∈ F instead of a neural network.
4. [HM45] One could provide an interpretation of the cognitive action that very much resembles the regularized risk in machine leaning. In this case, the kinetic energy can be regarded as the
3. [M20] Consider the dissipation function ρ(t) = eθt withθ < 0. Discuss the dynamics in the special case of supervised learning.
2. [M20] Consider the trivial dissipation function which is identically constant, that is, ∀t ∈[0 . . T ] : ρ(t) = 1. Discuss the dynamics in the special case of supervised learning.
15. [23] Complete the proof on the asymptotic behavior of the cross-entropy penalty in the case xκo → 1.#!# 1. [M25] Write down the differential equations of learning the connection weights wi

Showing 400 - 500 of 816