We have so far been concerned with predicting numerical quantities, like salaries. Now suppose we want to predict which dining option on campus a UCSD student will select for lunch (e.g., Burger King, Tahini, FanFan, Dirty Birds, Taco Villa, etc.). Predicting a discrete category (as opposed to a number) is an important machine learning task called classification. We can use empirical risk minimization to make classifications, too. Suppose we have gathered a data set of n lunch preferences of students over the past week: Burger King Dirty Birds Burger King Tahini Taco Villa Burger King Dirty Birds Taco Villa The first step is to choose a number which will uniquely represent each option. For instance: Dirty Birds 1 Burger King 2 Tahini 3 Fan-Fan 4 Taco Villa 5 We then map each instance of a dining option to its corresponding number, giving us a new data set of numbers. For instance, the data above becomes: We then map each instance of a dining option to its corresponding number, giving us a new data set of numbers. For instance, the data above becomes: 21235215 a) 6. Now that we have converted the data to a list of numbers, we can make a prediction by minimizing the mean absolute loss. Explain why this is not a good idea. b) 66 Let's use the zero-one loss, which is defined as follows: L01(h,y)={0,1,h=yh=y As usual, define the risk to be the average loss: R01(h)=n1i=1nL01(h,y) Notice that R01(h) can be interpreted as the misclassification rate. That is, if R01(h)=.7, then predicting h would result in the wrong answer for 70% of the data points. Given the data set {4,2,4,1,3,4,4,3,2,5}, plot the empirical risk R01(h) for h[0,5]. Hint: the function should have point discontinuities. c) 6. Is gradient descent useful for minimizing the risk with zero-one loss? Why or why not? Make reference to your plot of the risk in your answer. Hint: the risk is indeed non-convex, but gradient descent can still be useful for minimizing non-convex functions. Is there some other reason