Question

1 Approved Answer

Posted on May 17, 2024

Effect of Constraints on Optimal Solutions A key result that optimal classifiers pick the most probable class, which defines Bayes optimality. One of the consequences

Effect of Constraints on Optimal Solutions

A key result that optimal classifiers pick the most probable class, which defines Bayes optimality. One of the consequences is that optimal decisions are pure or crisp, they don't weight or mix decisions. But is this result always true? Here we show how constraints can make the answer to this question NO by understanding how they change the nature of an optimal solution.

One of the key ideas in the course is that the knowledge we have about the structure of our data also constitutes data that we can encode as constraints. A simple example is when a variable has upper and/or lower bounds. As a concrete example, we might want to choose the best product (e.g., pizza) for a discrete set of use cases (e.g., food restriction types like vegetarian = 1, gluten-free = 2, dairy-free = 3, etc.) for the lowest price, given a dataset with features x = [x_item, x_cust, x_price] labeled by the best use case y = {0,1,2,...}.

Consider an N-class classification problem with features x in R^d and one-hot encoded labels y = e_k, where e_k are unit vectors with 1 at component k and zero elsewhere, and k in {1,...,N}. Assume the data is D-distributed (x, y) ~ D, where D is a fixed (but unknown) distribution on R^d x {0,1,2,...}. Assume p(y=e_k) = alpha_k such that sum_k alpha_k = 1. Consider the classifier given by

The standard classification loss is the error rate, given by L(f) = P(f(x) != y) = E[(x, y) ~ D][1(f(x) != y)], where 1 is the indicator function (for 0-1 loss), and L(.) is the expected 0-1 loss or true error rate. Professor Bayes claims the following for any other classifier function g != f: R^d -> {0,1,2,...}, we have L(f) <= L(g), which is the definition of Bayes optimal for the proper choice of beta_jk. The result is standard and easy to find.

(a) Using the notation above and your own words explain why Professor Bayes' claim is true and determine what beta_jk must be.

(b) For part b, consider the following scenario. You have been given pizza data as described above and trained the optimal classifier. However, you find that the classifier performs very poorly on test data. Because you are good with data, you go back to the raw data and find that it includes two pieces of information about the customers that were left out: the customer's daily income and expenses, which you encode into a new feature vector z_cust. These features were left out because they were found to be independent of the customer's food restriction type and the pizza option features (value, restriction type) are independent of the customer's income And expenses. In addition, you discover that the labeled data was generated by actual customers using a drop-down menu to select their food restrictions and then selected the pizza option that was "the best value" from a list of options which satisfied the indicated restriction. Value information is given in terms of cost for a standardized slice. If you include the daily income and expense values into the feature vector, you find that the resulting classifier is significantly better with near-perfect performance. Explain what could have gone wrong with the original classifier to produce these results and how we could modify the data collection task and data labeling to better reflect the problem domain. In your explanation, convert the information described in the scenario into conditional probabilities over the variables x_cust, x_price, z_cust, S_cust, y_cust_option, S_option, where S_cust are sets of restriction types needed for each customer, and S_option are sets of restriction types satisfied by each pizza option, and y_cust_option are the restriction type of the option each customer selected as the best value for them.

(c) There is an apparent paradox between the result in part a and the scenario in part b (although hopefully your analysis has uncovered a resolution). To understand the paradox, assume that the classifier f is Bayes optimal and g is different from f. If the true labels in these regions are k, then we should always pick classifier f, which is justified on the basis of the logic of best guessing: observing a label at a point x in the feature space is like observing a biased random variable (like a coin, dice, etc.) that comes up with value k with higher probability than the other options. The best guess (in terms of minimum error) is to always select k which succeeds with probability k. Otherwise, each time we select a different option j != k, our guesses succeed with a lower probability rho_j. Show that choosing j for a fraction gamma of the total choices results in a success rate of gamma rho_j + (1-gamma) rho_k and use the fact that this is a convex combination to prove that the maximal success rate is rho_k.

The general case is that there must be some region of the feature space where g is assigning different class values. Using the notation R_f(k) = { x : e_k = f(x) } to represent the set inverse of the classifier f for output k, then there are differences Delta_jk R = R_f(k) cap R_g(j) and some of the Delta_jk R != empty set. Illustrate the idea with an appropriate drawing.

(d) One of the key constraints in the problem could be that customers have budgets. Let's simplify the problem to see what goes wrong. Assume you have two pizza options in different classes but both satisfy the customer's food restrictions. One option has better value, but they come in a discrete set of unit sizes (say two), and have a discrete set of costs. If the customer has a budget constraint in terms of not exceeding a total cost threshold, show it is possible for the option with the worse value to be the better choice (assuming eating pizza is better than starving) if the unit sizes are smaller for the higher cost pizza.