Question: Question. A 2-armed bandit instance I has as the mean rewards of its arms P, P2 [0, 1], where P1 P2|=A> 0. Both arms
Question. A 2-armed bandit instance I has as the mean rewards of its arms P, P2 [0, 1], where P1 P2|=A> 0. Both arms produce 0 and 1 rewards (that is, from Bernoulli distributions). Suppose we are given A, but we do not know which arm has the higher mean reward. Our aim is to determine the optimal arm with probability at least 1-6. In order to do so, we pull each arm N times, and declare as our answer the arm which registers the higher empirical mean (breaking ties uniformly at random). Show that it suffices to set log in order to indeed give the correct answer with probability at least 1 - 8. N-0 1
Step by Step Solution
3.38 Rating (154 Votes )
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
