Question: Consider a 2-armed bandit instance B in which the rewards from the arms come from uniform distributions (recall that the lectures assumed they came

Consider a 2-armed bandit instance B in which the rewards from the

Consider a 2-armed bandit instance B in which the rewards from the arms come from uniform distributions (recall that the lectures assumed they came from Bernoulli distributions). The rewards of arm 1 are drawn uniformly at random from [a, b], and the rewards of arm 2 are drawn uniformly at random from [c, d], where 0 < a < c < b < d < 1. Observe that this means there is an overlap: both arms produce some rewards from the interval [c, b]. An algorithm L proceeds as follows. First it pulls arm 1; then it pulls arm 2; whichever of these arms produced a higher reward (or arm 1 in case of a tie) is then pulled a further 20 times. In other words, the algorithm performs round-robin exploration for 2 steps and greedily picks an arm for the subsequent exploitation phase, during which that arm is blindly pulled 20 times. What is the expected cumulative regret of L on B after 22 pulls? (If you have worked out an answer but are not sure about it, consider writing a small program to simulate L and run it many times for fixed a, b, c, d. Is the average regret from these runs close to your answer? The program is for your own sake; no need to submit or to explain to us.)

Step by Step Solution

3.46 Rating (156 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

The detailed ... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Computer Engineering Questions!