Answered step by step
Verified Expert Solution
Question
1 Approved Answer
(Thompson sampling always optimal) Thompson sampling and U03 are two of the most popular algorithms for the multiarmed bandit problem. We have also seen evidence
(Thompson sampling always optimal) Thompson sampling and U03 are two of the most popular algorithms for the multiarmed bandit problem. We have also seen evidence for their optimality, but only under the assumption that the arms' reward mean parameters are independently chosen: for example, pulling arm 1 does not yield any partial information about the other arms. ln this problem, we will examine what happens if arms do reveal partial information about each other: is Thompson sampling still the best idea? In particular, we will consider a 4-armed bandit instance, and 6 to be an unknown parameter that is drawn uniformly at random from the set {1,2,3}. In our model, the parameter (9 will inuence all of the rewards of the 4 arms. Then, the rst 3 arms will yield Bernoulli rewards (i.e. in {0, 1}) with means given by: () if9:1 (#17M27I'L3): (%% %) ifa : 2 (%,%,) 1ft? :3. Moreover, the fourth arm will yield detemmtsttc reward equal to 416" i.e. there is no noise in the reward observation. While our algorithms will not know the value of 6, they will know that the reward of the 4-th arm is deterministic. ln this problem, we will evaluate: a) the pseudo-regret for each value of (9, and b) the Bayesian regret over the uniform distribution that is specied on 6. (a) Report the identity of the optimal arm in each of the cases 0 : 17 2, 3. (b) Consider the Thompson sampling algorithm run with the uniform prior over the instances 9 : 1, 2, 3. Evaluate the probability distribution over the action taken at the rst round, which is denoted by A1. (c) Suppose that Nature gave us 6 : 1 (therefore, the true instance is given by .114 : 2/3, Jug : 1/2, #3 : 1/3, #4 : 1/4). Evaluate the probability distribution over the action taken at the second round, which is denoted by A2, in all six cases: a) A1 : 1 and observed reward equal to 1, b) A1 : 1 and observed reward equal to O, c) A1 : 2 and observed reward equal to 1, d) A1 = 2 and observed reward equal to 0, e) A1 = 3 and observed reward equal to 1, f) A1 = 3 and observed reward equal to O
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started