The aim of this exercise is to prove and extend the table of Figure 7.5 (page 277).

Question:

(a) Prove the optimal predictions for training data of Figure 7.5. To do this, find the minimum value of the absolute error, the squared error, the log loss, and the value that gives the maximum likelihood. The maximum or minimum value is either at an end point or where the derivative is zero. [Hints: For squared error and log loss, take the derivative and set to zero. For the absolute error, consider how the error changes when moving from one data point to the next (in order). It might help to consider first the case where the data points are at consecutive integers, and then generalize.]

(b) To determine the best prediction for the test data, assume that the data cases are generated stochastically according to some true parameter p. [See the thought experiment (page 302).] Try the following for a number of different values for p selected uniformly in the range [0, 1]. Generate n training examples (try various values for n, some small, say 2, 3, or 5, and some large, say 1000) by sampling with probability p0 from these training examples. Let n0 be the number of false examples and n1 be the number of true examples in the training set (so n0 + n1 = n). Generate a test set of size, say, 20, that contains many test cases using the same parameter p. Repeat this for 1000 times to get a reasonable average. Which of the following gives a lower error on the test set for each of the optimality criteria: (I) absolute error; (II) squared error; and (III) log loss.
(i) The mode.
(ii) n1/n.
(iii) If n1 = 0, use 0.001, if n0 = 0, use 0.999, else use n1/n. Try this for different numbers when the counts are zero.
(iv) (n1 + 1)/(n + 2)
(v) (n1 + α)/(n + 2α) for different values of α > 0.
(vi) Another predictor that is a function of n0 and n1.
(For the mathematically sophisticated, try to prove what the optimal predictor is for each criterion.)

Fantastic news! We've Found the answer you've been seeking!