Questions and Answers of Pattern Recognition And Machine Learning

10. Propose a suitable test to compare the expected rewards of two reinforcement learning algorithms.
9. Propose a suitable test to compare the errors of two regression algorithms.
8. If we have two variants of algorithm A and three variants of algorithm B, how can we compare the overall accuracies of A and B taking all their variants into account?
7. Let us say we have three classification algorithms. How can we order these three from best to worst?
6. Use the normal approximation to the binomial for the sign test.
5. Show that the total sum of squares can be split into between-group sum of squares and within-group sum of squares as SST = SSb + SSw.
4. The K-fold cross-validated t test only tests for the equality of error rates. If the test rejects, we do not know which classification algorithm has the lower error rate. How can we test whether
3. Assume that xt ∼N(μ,σ2) where σ2 is known. How can we test for H0 : μ ≥μ0 vs. H1 : μ < μ0?
2. We can simulate a classifier with error probability p by drawing samples from a Bernoulli distribution. Doing this, implement the binomial, approximate, and t tests for p0 ∈ (0, 1). Repeat these
1. In a two-class problem, let us say we have the loss matrix where λ11 = λ22 = 0,λ21 = 1 and λ12 = α. Determine the threshold of decision as a function of α.
10. Rework the tiger example using the following reward matrix r(A,Z) Tiger left Tiger right Open left −100 +10 Open right 20 −100
9. In the tiger example, show that as we get a more reliable sensor, the range where we need to sense once again decreases.
8. Give an example of a reinforcement learning application that can be modeled by a POMDP. Define the states, actions, observations, and reward.
7. Using equation 18.22, derive the weight update equations when a multilayer perceptron is used to estimate Q..
6. Assume we are estimating the value function for states V(s) and that we want to use TD(λ) algorithm. Derive the tabular value iteration update.
5. In exercise 1, assume that the reward on arrival to the goal state is normal distributed with mean 100 and variance 40. Assume also that the actions are also stochastic in that when the robot
4. Instead of having γ < 1, we can have γ = 1 but with a negative reward of −c for all intermediate (nongoal) states. What is the difference?
3. In exercise 1, how does the optimal policy change if another goal state is added to the lower-right corner? What happens if a state of reward −100 (a very bad state) is defined in the
2. With the same configuration given in exercise 1, use Q learning to learn the optimal policy.
1. Given the grid world in figure 18.12, if the reward on reaching on the goal is 100 and γ = 0.9, calculate manually Q∗(s, a), V∗(S), and the actions of optimal policy.
10. In section 17.10, we discuss that if we use a decision tree as a combiner in stacking, it works both as a selector and a combiner. What are the other advantages and disadvantages?
9. How can we combine the results of multiple clustering solutions?
8. To be able to use cascading for regression, during testing, a regressor should be able to say whether it is confident of its output. How can we implement this?
7. In cascading, why do we require θj+1 ≥ θj?
6. What is the difference between voting and stacking using a linear perceptron as the combiner function?
5. Propose a dynamic regressor selection algorithm.
4. In the mixture of experts architecture, we can have different experts use different input representations. How can we design the gating network in such a case?
3. Propose an incremental algorithm for learning error-correcting output codes
2. In bagging, to generate the L training sets, what would be the effect of using L-fold cross-validation instead of bootstrap?
1. If each base-learner is iid and correct with probability p > 1/2, what is the probability that a majority vote over L classifiers gives the correct answer?
8. Let us say we have a set of documents where for each document, we have one copy in English and one in French. How can we extend latent Dirichlet allocation for this case?
7. active learning Active learning is when the learner is able to generate x itself and ask a su pervisor to provide the corresponding r value during learning one by one, instead of passively being
6. Propose a filtering algorithm to choose a subset of the training set in Gaussian processes.
5. In figure 16.10, how does the fit change when we change s2?
4. What is Var(r ) when the maximum likelihood estimator is used? Compare it with equation 16.25.
3. As above, except that assume that p(q)∼N(μ0,σ2 0 ). Also assume n is large so that you can use central limit theorem and approximate binomial by a Gaussian. Derive p(q|x).
2. Let us denote by x the number of spam emails I receive in a random sample of n. Assume that the prior for q, the proportion of spam emails is uniform in [0, 1]. Find the posterior distribution for
1. For the setting of figure 16.3, observe how the posterior changes as we change N, σ2, and σ2 0 .
10. How can we have an incremental HMM where we add new hidden states when necessary?
9. Let us say at any time we have two observations from two different alphabets;for example, let us say we are observing the values of two currencies every day. How can we implement this using HMM?
8. Consider the urn-and-ball example where we draw without replacement. How will it be different?
7. If in equation 15.38 we have multivariate observations, what will the M-step equations be?
6. Generate training and validation sequences from an HMM of your choosing.Then train different HMMs by varying the number of hidden states on the same training set and calculate the validation
5. Some researchers define a Markov model as generating an observation while traversing an arc, instead of on arrival at a state. Is this model any more powerful than what we have discussed?
4. Show that any second- (or higher-order) Markov model can be converted to a first-order Markov model.
3. Formalize a second-order Markov model. What are the parameters? How can we calculate the probability of a given state sequence? How can the parameters be learned for the case of an observable
2. Using the data generated by the previous exercise, estimate Π, A and compare with the parameters used to generate the data.
1. Given the observable Markov model with three states, S1, S2, S3, initial probabilitiesΠ = [0.5, 0.2, 0.3]T 15.12 Exercises 441 and transition probabilities A = ⎡⎢⎣0.4 0.3 0.3 0.2 0.6 0.2
9. Generally, in a newspaper, a reporter writes a series of articles on successive days related to the same topics as the story develops. How can we model this using a graphical model?
8. Propose a suitable goodness measure that can be used in learning graph structure as a state-space search. What are suitable operators?
7. Write down the graphical model for linear logistic regression for two classes in the manner of figure 14.7.
6. Draw the Necker cube as a graphical model defining links to indicate mutually reinforcing or inhibiting relations between different corner interpretations.
5. Show that in a directed graph where the joint distribution is written as equation 14.12,x p(x) = 1.
3. In figure 14.4, calculate P(R|W), P(R|W,S), and P(R|W,∼S).
2. For a head-to-head node, show that equation 14.10 implies P(X,Y) = P(X) ·P(Y).
1. With two independent inputs in a classification problem, that is, p(x1, x2|C) =p(x1|C)p(x2|C), how can we calculate p(x1|x2,C)? Derive the formula for p(xj|Ci)∼N(μij,σ2 ij ).
10. Let us say we have two representations for the same object and associated with each, we have a different kernel. How can we use both to implement a joint dimensionality reduction using kernel PCA?
9. In a setting such as that in figure 13.12, use kernel PCA with a Gaussian kernel.
8. How can we use one-class SVM for classification?
7. Derive the kernelized version of the primal, dual, and the score functions for ranking..
6. In kernel regression, what is the effect of using different on bias and variance?
5. In kernel regression, what is the relation, if any, between and noise variance?
4. In the localized multiple kernel of equation 13.40, propose a suitable model for ηi(x|θi) and discuss how it can be trained.
3. In the empirical kernel map, how can we choose the templates?
2. In equation 13.31, how can we estimate S?
1. Propose a filtering algorithm to find training instances that are very unlikely to be support vectors.
9. Formalize the hierarchical mixture of experts architecture with two levels.Derive the update equations using gradient descent for regression and classification.
8. Derive the update equations for the competitive mixture of experts for classification.
7. Derive the update equations for the cooperative mixture of experts for classification.
6. Formalize a mixture of experts architecture where the experts and the gating network are multilayer perceptrons. Derive the update equations for regression and classification.
5. Compare the number of parameters of a mixture of experts architecture with an RBF network.
4. Show how the system given in equation 12.22 can be trained.
3. Derive the update equations for the RBF network for classification (equations 12.20 and 12.21).
2. Write down the RBF network that uses elliptic units instead of radial units as in equation 12.13.
1. Show an RBF network that implements XOR.
14. For the MLP given in figure 11.22, derive the update equations for the unfolded network.
13. Incremental learning of the structure of a MLP can be viewed as a state space search. What are the operators? What is the goodness function? What type of search strategies are appropriate? Define
12. In the autoencoder network, how can we decide on the number of hidden units?
11. Derive the update equations for soft weight sharing.
10. In section 11.6, we discuss how an MLP with two hidden layers can implement piecewise constant approximation. Show that if the weight in the last layer is not a constant but a linear function of
9. Derive the update equations for an MLP implementing Sammon mapping that minimizes Sammon stress (equation 11.40).
8. In cascade correlation, what are the advantages of freezing the previously existing weights?
7. Parity is cyclic shift invariant; for example, “0101” and “1010” have the same parity. Propose a multilayer perceptron to learn the parity function using this hint.
6. Consider an MLP architecture with one hidden layer where there are also direct weights from the inputs directly to the output units. Explain when such a structure would be helpful and how it can
5. Derive the update equations for an MLP with two hidden layers.
4. Derive the update equations when the hidden units use tanh, instead of the sigmoid. Use the fact that tanh = (1 − tanh2).
3. Show the perceptron that calculates the parity of its three inputs.
2. Show the perceptron that calculates NAND of its two inputs.
1. Show the perceptron that calculates NOT of its input.
10. For the sample data in figure 10.11, define ranks such that a linear model would not be able to learn them. Explain how the model can be generalized so that they can be learned.
9. Let us say for univariate x, x ∈ (2, 4) belong to C1 and x < 2 or x > 4 belong to C2. How can we separate the two classes using a linear discriminant?
8. In the univariate case for classification as in figure 10.7, what do w and w0 correspond to?
7. What is the implication of the use of a single η for all xj in gradient descent?
6. In using quadratic (or higher-order) discriminants as in equation 10.34, how can we keep variance under control?
5. How can we learn Wi in equation 10.34?
4. With K = 2, show that using two softmax outputs is equal to using one sigmoid output.
3. Show that the derivative of the softmax, yi = exp(ai)/j exp(aj ), is ∂yi/∂aj =yi(δij − yj ), where δij is 1 if i = j and 0 otherwise.
2. For the two-dimensional case of figure 10.2, show equations 10.4 and 10.5.
10. In a multivariate tree, very probably, at each internal node, we will not be needing all the input variables. How can we decrease dimensionality at a node?
9. Let us say that for a classification problem, we already have a trained decision tree. How can we use it in addition to the training set in constructing a knearest neighbor classifier?

Showing 1 - 100 of 816