All Matches
Solution Library
Expert Answer
Textbooks
Search Textbook questions, tutors and Books
Oops, something went wrong!
Change your search query and then try again
Toggle navigation
FREE Trial
S
Books
FREE
Tutors
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Hire a Tutor
AI Tutor
New
Search
Search
Sign In
Register
study help
business
nonparametric statistical inference
Questions and Answers of
Nonparametric Statistical Inference
Consider the following method for using data to come up with a good classification rule: use the data to estimate the background probability distributions and then use these distributions to find a
Philosophers sometimes suggest that there is a sense in which you are justified in believing something only if you believe it as the result of a reliable process of belief formation. What do you
How can boosting turn a weak classifier into a better classifier?
If ht(x) is the rule produced by boosting at round t, write the expression for the final classifier H (x) after T rounds in terms of the ht(x) and the weights αt .
True or false: Boosting is best seen as a special case of a kernel method.
What is boosting? What is a “weak” learning rule? When might boosting be a useful method of improving the results of weak learning rules?
Suppose that in the transformed space the feature vectors and corresponding labels are given as follows:(−1, 0; +), (0, 1; +), (1, 0; +), (−1, 2; −), (−1, 3; −), (0, 3; −), (1, 5; −)(a)
Why are SVMs called support vector machines. What is a support vector and what role does it play in an SVM?
SVMs solve a classification problem by, in effect, transforming the feature space into another typically higher dimensional, sometimes infinitely dimensional space and then finding a (possibly soft)
How does the construction of the previous problem affect the margin of the classifier?
Show that by increasing the dimension of the feature vector by 1, a hyperplane passing through the origin in the augmented space can represent a general hyperplane decision rule in the original space.
If we insist that the hyperplane pass through the origin, what will the form of the rule be?
What is the form of the decision rule for a linear SVM?
True or False: In SVMs, using linear rules in the transformed space can provide nonlinear rules in the original feature space.
It is sometimes suggested that in selecting a hypothesis on the basis of data, one should balance the empirical error of hypotheses against their simplicity. Does statistical learning theory provide
What is the instrumentalist conception of theories?
Is it reasonable for you to believe that you are not a brain in a vat being given the experiences of an external world? Explain your answer.
How should scientists measure the simplicity of hypotheses? Should a scientist favor simpler hypotheses over more complex hypotheses that fit the data equally well? Are there alternatives?
Is it reasonable to use simplicity to decide among hypotheses that could not be distinguished by any possible evidence?
Does the hypothesis that all emeralds are grue imply that emeralds will change color in 2050?
How could there be nondenumerably many hypotheses? Could a nondenumerable set of things be well ordered?
Critically assess the following argument. “The hypothesis that the world is just a dream is simpler than the hypothesis that there really are physical objects. The two hypotheses account equally
Suppose that competing hypotheses h1 and h2 fit the data equally well but h1 is simpler than h2. Does a policy of accepting h1 rather than h2 in such a case have to assume that the world is simple?
Consider the class F of all nondecreasing functions of a single variable x. That is, a function f (x) belongs to F if f (x2) ≥ f (x1) whenever x2 > x1. What is the pseudo-dimension of F ? Justify
a(a) If we let F be the set of all piecewise constant functions (with no bound on the number of constant pieces and where the value of F may increase or decrease from one piece to the next), what is
Consider the class F consisting of all piecewise constant functions with two pieces. That is, f ∈ F if for some constants α1, α2, β, f (x) is of the form f (x) =α1 x ≤ βα2 x > β.What is
What is the pseudo-dimension of all constant functions? That is, f ∈ F if f (x)is of the form f (x) = α for some constant α.
In function estimation (as opposed to classification), finiteness of which of the following is sufficient for PAC learning? VC-dimension; pseudo-dimension;Popper dimension; dimension-X; fractal
True or False: In the function estimation problem, if x is the observed feature vector and y is a real value that you wish to estimate, the optimal Bayes decision rule (for least squared error)
True or False: The Bayes error rate for estimation problems (as opposed to classification problems) can be greater than 1/2.
a(a) Suppose we consider the function estimation problem (rather than the classification problem), but insist on using the probability of error as our success criterion (rather than squared error).
Discuss the quip “Even a stopped clock is right twice a day,” and contrast it with the addition that “and a working clock is almost never exactly right.”
If a loss function other than squared error is used, will the regression function(i.e., the conditional mean of y, given x) still always be the best estimator?
If one learning algorithm works with a class of rules that includes all the rules of a different algorithm and some others, does this mean that the former algorithm should give a strictly better
One might argue that every learning algorithm works with a fixed set of decision rules, namely, the set of all rules that the particular algorithm might possibly produce over all possible
If the correct classification of an item is completely determined by its observable features, what is the Bayes error rate for decision rules using those features? In this case, does it follow that C
a(a) In PAC learning, suppose we let the learner choose the feature points he/she would like classified instead of providing randomly drawn examples.Suggest a way in which you might measure how much
In PAC learning, discuss the advantages and disadvantages of requiring that the same sample size m(, δ) work for every choice of prior probabilities and conditional densities. How might we modify
Consider the class C of all convex subsets of the plane. What is VCdim(C)?Justify your answer. (Hint: think about points on the circumference of a circle.)
Consider the class C of all orthogonal rectangles in the plane, that is, all rectangles whose sides are parallel to the coordinate axes. What is VCdim(C)?Justify your answer.
Is the class of rules representable by a perceptron PAC learnable?
What relations of the form X ≤ Y hold between R∗, R(h) ˆ , and R∗C?
If C is a class of decision rules, what are R∗, R(h) ˆ , and R∗C?
If C is a class of decision rules, what condition on VCdim(C) is needed for PAC learnability?
If VCdim(C) = v, what is the smallest number of rules that the class C could contain?
True or False: VCdim(C) is one plus the number of parameters needed to specify a particular rule from the class C.
True or False: VCdim(C) is the largest integer v such that every set of v points can be shattered by C.
True or False: A set of k points is shattered by a class of rules C if all 2k labelings of the points can be generated using rules from C.
a(a) Describe as precisely as you can what it means for C to be PAC learnable, explaining the roles of and δ and the requirements on the sample size.(b) Why do we settle for R∗C instead of R∗
If one learning algorithm works with a class of rules that includes all the rules of a different algorithm and some others, does this mean that the former algorithm should give a strictly better
One might argue that every learning algorithm works with a fixed set of decision rules, that is to say, the set of all rules that the particular algorithm might possibly produce over all possible
Let C be a class of decision rules and let h be the hypothesis produced by a learning algorithm. Write the condition on the error rate R(h) for PAC learnability in terms of R∗C, , and δ.
What is the class of rules that can be represented by a perceptron?
Is the process of reasoning toward reflective equilibrium analogous to the algorithm of gradient descent that is used to train a neural network?
Explain why the sort of learning rule discussed in this chapter is appropriately called “backpropagation.”
Why is the weighted input to a unit in a feedforward network passed through a sigmoid function rather than a simple threshold function?
What is “gradient descent” and what is a potential problem with it?
True or false: The backpropagation learning method for feedforward neural networks will always find a set of weights that minimizes the error on the training data.
True or false: The backpropagation learning method requires that the units in the network have sharp thresholds.
Consider a classification problem in which each instance consists of d features x1,...,xd , each of which can only take on the values 0 or 1. A feature vector belongs to class 0 if x1 + x2 +···+
Explain the sense in which for any decision rule there is a three-layer network that approximates that rule. Sketch a proof of this.
What is a convex set?
Consider the XOR problem with two inputs x1 and x2. That is, the ouput is 1 if exactly one of x1, x2 is positive, and the output is 0 otherwise. Construct a simple three-layer network to solve this
How is a perceptron trained? What features of the perceptron typically change during training? Formulate a learning rule for making such changes and explain how it works. (a) What is a single
Consider the following network. There are four inputs with real values. Each input is connected to each of two perceptrons on the first layer that do not take thresholds but simply output the sum of
As in the previous problem, consider a classification problem in which each instance consists of d features x1,...,xd , each of which can take on only the values 0 or 1. A feature vector belongs to
Consider a classification problem in which each instance consists of d features x1,...,xd , each of which can take on only the values 0 or 1. Come up with a linear threshold unit (a single
Design a linear threshold unit with two inputs that outputs the value 1 if and only if the first input has a greater value than the second. (What are the weights on the inputs and what is the
Consider a linear threshold unit (perceptron) with three inputs and one output.The weights on the inputs are respectively 1, 2, and 3, and the threshold is 0.5.If the inputs are respectively 0.1,
a(a) Sketch a diagram of a perceptron with three inputs x1, x2, x3 and weights w1, w2, w3. Label the inputs, weights, and output.(b) Write the expression for the output in terms of the inputs and the
Repeat parts (a), (b), and (c) of the previous problem for the triangular kernel.For part (d), find the value h for which the classification rule first decides 0 for all x.
Consider the special case where we have a 1-dimensional feature vector and are interested in using a kernel rule. Suppose we have the training data (0,0),(1,1), and (3,0), and we use the simple
True or False: There might be some set of labeled data (training examples)such that the 1-nearest neighbor method and the Kernel method (for some K(x) of your choice) can give exactly the same
True or False: The decision rules arising from using a kernel method with two different kernel functions, K1(x) and K2(x) = 2K1(x), are exactly the same.
What conditions are required on the smoothing parameter hn for a kernel rule to be universally consistent?
For the kernel function of the previous problem, smoothing factor h = 0.5 and xi = 3, sketch K( x−xi h ).
For a one-dimensional feature x, write the equation for the triangular kernel function, K(x).
For a one-dimensional feature x, sketch the simple kernel function K(x) =I{|x|≤1}.
Write the expression for the vote count for class 0, v0 n (x) in terms of the data(x1, 1), . . . , (xn, n), indicator functions, and the kernel K(x).
For the simplest kernel classification rule, what choices of the smoothing parameter h are analogous to selecting kn = 1 and kn = n, respectively, in the nearest neighbor rule? What happens to the
Briefly discuss the following position. Under appropriate conditions, the kn-NN rule is universally consistent, so the choice of features does not matter.
If we use a kn-NN rule with kn = n, what would be the resulting error rate in terms of P (0), P (1), P (x|0), and P (x|1)?
What conditions are required on kn for the kn-NN rule to be universally consistent?
Describe as precisely as you can the tradeoffs of having a small kn versus a large kn in the kn-nearest neighbor classifier. What happens in the extreme cases when kn = 1 and when kn = n?
Come up with a case (i.e., give the prior probabilities and conditional densities)in which the error rate of the NN rule equals the Bayes error rate, and briefly explain why this happens in the case
Recall that for the 1-NN rule, the region associated with a feature vector xi is the set of all points that are closer to xi than to any of the other feature vectors xj for j = i. These are the
a(a) What is the NN rule and how does the expected error from the use of this rule compare with the Bayes error rate?(b) What conditions on kn are required for the kn-NN rule to have an asymptotic
Will the remaining training data be independent? Identically distributed according to the underlying distributions?(d) If we use method M but on the training data as modified in part (c) by throwing
Consider a pattern recognition problem with prior probabilities P (0) = 0.2 and P (1) = 0.8 and conditional distributions P (x|0) and P (x|1). Let R∗ denote the Bayes error rate for this problem.
What is inductive bias? Is it good or bad?
What is the curse of dimensionality?
When might a brute force approach to the learning problem be useful?
What does it mean for examples to be drawn to be iid?
What sorts of assumptions do we need to make about the training data?
What are training data for learning from examples?
Why do we need learning? Why cannot we simply use a Bayes rule for pattern classification?
What is the general learning problem we are concerned with?
True or false? P (A|B) might have a definite value even if P (B) = 0.
Suppose the feature vector x can only take values −1 or +1, and the Bayes rule is as follows: decide 0 when x = −1 and decide 1 when x = +1. Suppose the cost of a correct decision is 0, but the
Suppose that 5% of women and 0.25% of men in a given population are colorblind.(a) Write Bayes Theorem.(b) If a colorblind person is chosen at random from a population containing an equal number of
Showing 1 - 100
of 4210
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Last