Question

1 Approved Answer

Posted on Oct 11, 2024

Please help with this problem. Thanks. 2 Regret of Explore-thenCommit In this problem, we analyze the regret of the Explore-thenCommit algorithm for the multi- armedbandit

Please help with this problem. Thanks.

2 Regret of Explore-thenCommit In this problem, we analyze the regret of the Explore-thenCommit algorithm for the multi- armedbandit (MAB) problem. We consider a stochastic MAB problem with a set of K = 2 arms A = {1, 2}. Recall from Lab 8 that each arm A 6 A is associated with a reward distribution XA ~ I? A, with corresponding mean 11;, = E [X A] We will assume throughout this problem that the rst arm has higher average reward, i e. 111 > 112.1% each round t = 1,. ,7; our algorithm chooses an arm A; E A and receives a corresponding reward X A1 1P ,4\" independent of all previous rewards. If we knew arm 1 has higher average reward we would choose A: = 1 each round in order to maximize the expected total reward. In practice, however, we do not know which arm is better since the means {111, 112} are unknown. The expected reward of our algorithm will always be less than 7111.11, and we quantify the price we pay for not knowing the better arm via the regret _ \"W, 113E: X9] . In this problem we analyze the regret of the explore-thencommit (ETC) algorithm from problem 1(b). Recall from Algorithm El that ETC proceeds in two phases. In the ex- ploration phase, each (arm A E A is pulled c times in order to produce an estimate 13:11 = 1 02\"\" A1 A X A of the mean reward for that arm. In the commit phase, i e. for every t > CK, we choose At: A, where A: = arg maxaeA 11A is the apparent best arm at the end of the exploration phase. In the rst part of our analysis, we evaluate the probability that we incorrectly identify arm 2 as the best arm, i.e. lP(A = 2). (a) (2 points) Assume each reward is in the unit interval [0,1], i.e. U S XA g 1 for A 6 {1,2}. Show that .. CA2 where A = 111 112. Hint: apply Hoeffding's inequality from Lecture 18. In parts (b) and (c), we write the regret in terms of the probability lF'(A = 2). (b) (3 points) Let m denote the number of times arm 2 has been pulled, up to and including time '11. Show Rn = AlE[m]. Hint: Start from the following: 11:: XE] = ]E = M[ZH{A:E=1}(p1 4:93)] + IE [2 MA. = 2}(1u1 2(9)] . t=1 = \"#1 Note also that for all t, A, is independent of X? for A 6 {1, 2}. (c) (1 point) Show that if n > 20, then E[m] = c+ (n 20)]? (xi = 2) Hint: If n > 20, both arms are pulled deterministically 6 times during the rst 2c rounds. Afterward, an arm is only pulled if it is the one we have committed to. In parts (d) and (e), we nalize our bound on the regret R\". (d) (1 point) Show that Rugqcunsdewpg. Hint: combine parts (a)-(c). (e) (3 points) Suppose you knew the sub-optimality gap A. Solve for (and report) a value of c which guarantees that: For this number of exploratory pulls c, what is the upper bound on the regret from part (d)? Your answer should be in terms of n and A. Does this bound grow linearly in n, or does it do better (i.e. is it sublinear)