Questions and Answers of Pattern Recognition And Machine Learning

8. In regression trees, how can we get rid of discontinuities at the leaf boundaries?
7. Propose a rule induction algorithm for regression.
6. In a regression tree, we discussed that in a leaf node, instead of calculating the mean, we can do a linear regression fit and make the response at the leaf dependent on the input. Propose a
5. Derive a learning algorithm for sphere trees (equation 9.21). Generalize to ellipsoid trees.
4. In generating a univariate tree, a discrete attribute with n possible values can be represented by n 0/1 dummy variables and then treated as n separate numeric attributes. What are the advantages
3. Propose a tree induction algorithm with backtracking.
2. For a numeric input, instead of a binary split, one can use a ternary split with two thresholds and three branches as xj < wma, wma ≤ xj < wmb, xj ≥ wmb Propose a modification of the tree
1. Generalize the Gini index (equation 9.5) and the misclassification error (equation 9.6) for K > 2 classes. Generalize misclassification error to risk, taking a loss function into account.
12. In the running mean smoother, besides giving an estimate, can we also calculate a confidence interval indicating the variance (uncertainty) around the estimate at that point?
11. In the running smoother, we can fit a constant, a line, or a higher-degree polynomial at a test point. How can we choose among them?
10. Generalize kernel smoother to multivariate data.
9. Propose an incremental version of the running mean estimator, which, like the condensed nearest neighbor, stores instances only when necessary.
8. Write the error function for loess discussed in section 8.8.3.
7. In a regressogram, instead of averaging in a bin and doing a constant fit, we can use the instances falling in a bin and do a linear fit (see figure 8.14). Write the code and compare this with the
6. In condensed nearest neighbor, an instance previously added to Z may no longer be necessary after a later addition. How can we find such instances that are no longer necessary?
5. How does condensed nearest neighbor behave if k > 1?
4. How can we detect outliers after hierarchical clustering (section 7.8) ?
3. Parametric regression (section 5.8) assumes Gaussian noise and hence is not robust to outliers; how can we make it more robust ?
2. Show equation 8.16.
1. How can we have a smooth histogram?
11. Having generated a dendrogram, can we “prune” it?
10. How can we make k-means robust to outliers?
9. In hierarchical clustering, how can we have locally adaptive distances? What are the advantages and disadvantages of this?
8. What are the similarities and differences between average-link clustering and k-means?
7. How can we do hierarchical clustering with binary input vectors—for example, for text clustering using the bag of words representation?
6. Edit distance between two strings—for example, gene sequences—is the number of character operations (insertions, deletions, substitutions) it takes to convert one string into another. List the
5. In the mixture of mixtures approach for classification, how can we fine-tune ki , the number of components for class Ci?
4. Define a multivariate Bernoulli mixture where inputs are binary and derive the EM equations.
3. Derive the M-step equations for S in the case of shared arbitrary covariance matrix S (equation 7.15) and s2 in the case of shared diagonal covariance matrix (equation 7.16).
2. We can do k-means clustering, partition the instances, and then calculate Si separately in each group. Why is this not a good idea?
1. In image compression, k-means can be used as follows: The image is divided into nonoverlapping c×c windows and these c2-dimensional vectors make up the sample. For a given k, which is generally a
11. Discuss an application where there are hidden factors (not necessarily linear)and where factor analysis would be expected to work well.
10. In factor analysis, how can we find the remaining ones if we already know some of the factors?
9. How can we incorporate class information into Isomap or LLE such that instances of the same class are mapped to nearby locations in the new space?
8. Multidimensional scaling can work as long as we have the pairwise distances between objects. We do not actually need to represent the objects as vectors at all as long as we have some measure of
7. In Isomap, instead of using Euclidean distance, we can also use Mahalanobis distance between neighboring points. What are the advantages and disadvantages of this approach, if any?
6. Redo exercise 3, this time using Isomap where two cities are connected only if there is a direct road between them that does not pass through any other city.
5. In figure 6.11, we see a synthetic two-dimensional data where LDA does a better job than PCA. Draw a similar dataset where PCA and LDA find the same good direction. Draw another where neither PCA
4. In Sammon mapping, if the mapping is linear, namely, g(x|W) = WT x, how can W that minimizes the Sammon stress be calculated?
3. Plot the map of your state/country using MDS, given the road travel distances as input.
2. Using Optdigits from the UCI repository, implement PCA. For various number of eigenvectors, reconstruct the digit images and calculate the reconstruction error (equation 6.12).
1. Assuming that the classes are normally distributed, in subset selection, when one variable is added or removed, how can the new discriminant be calculated quickly? For example, how can the new
9. In document clustering, ambiguity of words can be decreased by taking the context into account, for example, by considering pairs of words, as in “cocktail party” vs. “party elections.”
8. In regression we saw that fitting a quadratic is equivalent to fitting a linear model with an extra input corresponding to the square of the input. Can we also do this in classification?
7. Let us say we have two variables x1 and x2 and we want to make a quadratic fit using them, namely, f (x1, x2) = w0 + w1x1 + w2x2 + w3x1x2 + w4(x1)2 + w5(x2)2 How can we find wi, i = 0, . . . , 5,
6. Let us say in two dimensions, we have two classes with exactly the same mean. What type of boundaries can be defined?
5. Another possibility using Gaussian densities is to have them all diagonal but allow them to be different. Derive the discriminant for this case.
4. For a two-class problem, for the four cases of Gaussian densities in table 5.1, derive log P(C1|x)P(C2|x)
3. Generate samples from two multivariate normal densities N(μi ,Σi ), i = 1, 2, and calculate the Bayes’ optimal discriminant for the four cases in table 5.1.
2. Generate a sample from a multivariate normal density N(μ,Σ), calculate m and S, and compare them with μ and Σ. Check how your estimates change as the sample size changes.
10. In equation 4.40, what is the effect of changing λ on bias and variance?
9. Let us say, given the samples Xi = {xti, rt i}, we define gi(x) = r 1 i , namely, our estimate for any x is the r value of the first instance in the (unordered) dataset Xi . What can you say about
8. When the training set is small, the contribution of variance to error may be more than that of bias and in such a case, we may prefer a simple model even though we know that it is too simple for
7. Assume a linear model and then add 0-mean Gaussian noise to generate a sample. Divide your sample into two as training and validation sets. Use linear regression using the training half. Compute
6. For a two-class problem, generate normal samples for two classes with different variances, then use parametric classification to estimate the discriminant points. Compare these with the
3. Write the code that generates a normal sample with given μ and σ, and the code that calculates m and s from the sample. Do the same using the Bayes’estimator assuming a prior distribution for
2. Write the log likelihood for a multinomial sample and show equation 4.6.
1. Write the code that generates a Bernoulli sample with given parameter p, and the code that calculates ˆp from the sample.
11. Show example transaction data where for the rule X → Y:(a) Both support and confidence are high.(b) Support is high and confidence is low.(c) Support is low and confidence is high.(d) Both
10. Associated with each item sold in basket analysis, if we also have a number indicating how much the customer enjoyed the product, for example, on a scale of 0 to 10, how can you use this extra
9. Show that as we move an item from the consequent to the antecedent, confidence can never increase: confidence(ABC → D) ≥ confidence(AB → CD).
8. Generalize the confidence and support formulas for basket analysis to calculate k-dependencies, namely, P(Y|X1, . . . , Xk).
6. Somebody tosses a fair coin and if the result is heads, you get nothing; otherwise, you get $5. How much would you pay to play this game? What if the win is $500 instead of $5?
5. Propose a three-level cascade where when one level rejects, the next one is used as in equation 3.10. How can we fix the λ on different levels?
11. One source of noise is error in the labels. Can you propose a method to find data points that are highly likely to be mislabeled?
10. Assume as in exercise 8 that our hypothesis class is the set of lines. Write down an error function that not only minimizes the number of misclassifications but also maximizes the margin.
9. Show that the VC dimension of the triangle hypothesis class is 7 in two dimensions.(Hint: For best separation, it is best to place the seven points equidistant on a circle.)
8. Assume our hypothesis class is the set of lines, and we use a line to separate the positive and negative examples, instead of bounding the positive examples as in a rectangle, leaving the
7. Derive equation 2.17.
6. In equation 2.13, we summed up the squares of the differences between the actual value and the estimated value. This error function is the one most frequently used, but it is one of several
8. Assume our hypothesis class is the set of lines, and we use a line to separate the positive and negative examples, instead of bounding the positive examples as in a rectangle, leaving the
7. Derive equation 2.17.
6. In equation 2.13, we summed up the squares of the differences between the actual value and the estimated value. This error function is the one most frequently used, but it is one of several
9. In estimating the price of a used car, it makes more sense to estimate the percent depreciation over the original price than to estimate the absolute price. Why?
7. If a face image is a 100 × 100 image, written in row-major, this is a 10,000-dimensional vector. If we shift the image one pixel to the right, this will be a very different vector in the
6. In a daily newspaper, find five sample news reports for each category of politics, sports, and the arts. Go over these reports and find words that are used frequently for each category, which may
5. In basket analysis, we want to find the dependence between two items X and Y. Given a database of customer transactions, how can we find these dependencies? How would we generalize this to more
4. Let us say we are given the task of building an automated taxi. Define the constraints.What are the inputs? What is the output? How can we communicate with the passenger? Do we need to communicate
3. Assume we are given the task of building a system to distinguish junk email.What is in a junk email that lets us know that it is junk? How can the computer detect junk through a syntactic
2. Let us say we are building an OCR and for each character, we store the bitmap of that character as a template that we match with the read character pixel by pixel. Explain when such a system would
1. Imagine we have two possibilities: We can scan and email the image, or we can use an optical character reader (OCR) and send the text file. Discuss the advantage and disadvantages of the two
16.7 k-reversible languages. A nite automaton A0 is said to be k-deterministic if it is deterministic modulo a lookahead k: if two distinct states p and q are both initial, or are both reached from
16.6 Algorithm for learning reversible languages. What is the DFA A returned by the algorithm for learning reversible languages when applied to the sample S =fab; aaabb; aabbb; aabbbbg? Suppose we
16.5 Learning with unreliable query responses. Consider the problem where the learner must nd an integer x selected by the oracle within [n], where n 1 is given. To do so, the learner can ask
16.4 Learning monotone DNF formulae with queries. Show that the class of monotone DNF formulae over n variables is eciently exactly learnable using membership and equivalence queries. (Hint: a prime
16.3 PAC learning with membership queries. Give an example of a concept class C that is eciently PAC-learnable with membership queries but that is not eciently exactly learnable.
16.2 VC-dimension of nite automata.(a) What is the VC-dimension of the family of all nite automata? What does that imply for PAC-learning of nite automata? Does this result change if we restrict
16.1 Minimal DFA. Show that a minimal DFA A also has the minimal number of transitions among all other DFAs equivalent to A. Prove that a language L is regular i Q = fSuL(u) : u 2 g is nite. Show
15.6 Random projection, PCA, and nearest neighbors.(a) Download the MNIST test set of handwritten digits at:http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz.Create a data matrix X 2 RNm
15.5 Expression for KLLE. Show the connection between LLE and KPCA by deriving the expression for KLLE.
15.4 Nystrom method. Dene the following block representation of a kernel matrix:K ="W K>21 K21 K22#and C ="W K21#:The Nystrom method uses W 2 Rll and C 2 Rml to generate the approximation eK=
15.3 Laplacian eigenmaps. Assume k = 1 and we seek a one-dimensional representation y. Show that (15.7) is equivalent to y = argminy0 y0>Ly0, where L is the graph Laplacian.
15.2 Double centering. In this problem we will prove the correctness of the double centering step in Isomap when working with Euclidean distances. Dene X andx as in exercise 15.1, and dene X as
15.1 PCA and maximal variance. Let X be an uncentered data matrix and letx = 1 mP i xi be the sample mean of the columns of X.(a) Show that the variance of one-dimensional projections of the data
14.4 Kernel stability. Suppose an approximation of the kernel matrix K, denoted K0, is used to train the hypothesis h0 (and let h denote the non-approximate hypothesis).At test time, no approximation
14.3 Stability of linear regression.(a) How does the stability bound in corollary 14.6 for ridge regression (i.e. kernel ridge regression with a linear kernel) behave as ! 0?(b) Can you show a
14.2 Quadratic hinge loss stability. Let L denote the quadratic hinge loss function dened for all y 2 f+1;????1g and y0 2 R by L(y0; y) =(0 if 1 ???? y0y 0;(1 ???? y0y)2 otherwise:Assume that
14.1 Tighter stability bounds(a) Assuming the conditions of theorem 14.2 hold, can one hope to guarantee a generalization with slack better than O(1=pm) even if the algorithm is very stable, i.e. !
13.4 Conditional Maxent with other marginal distributions: discuss and analyze conditional Maxent models when using a distribution Q over X instead of bD 1. Prove that a duality theorem similar to
13.3 Maximum conditional Maxent. An alternative measure of closeness, instead of the conditional relative entropy, is the maximum relative entropy over all x 2 X1.(a) Write the primal optimization

Showing 100 - 200 of 816