Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

(Thompson sampling always optimal) Thompson sampling and U03 are two of the most popular algorithms for the multiarmed bandit problem. We have also seen evidence

image text in transcribed
(Thompson sampling always optimal) Thompson sampling and U03 are two of the most popular algorithms for the multiarmed bandit problem. We have also seen evidence for their optimality, but only under the assumption that the arms' reward mean parameters are independently chosen: for example, pulling arm 1 does not yield any partial information about the other arms. ln this problem, we will examine what happens if arms do reveal partial information about each other: is Thompson sampling still the best idea? In particular, we will consider a 4-armed bandit instance, and 6 to be an unknown parameter that is drawn uniformly at random from the set {1,2,3}. In our model, the parameter (9 will inuence all of the rewards of the 4 arms. Then, the rst 3 arms will yield Bernoulli rewards (i.e. in {0, 1}) with means given by: () if9:1 (#17M27I'L3): (%% %) ifa : 2 (%,%,) 1ft? :3. Moreover, the fourth arm will yield detemmtsttc reward equal to 416" i.e. there is no noise in the reward observation. While our algorithms will not know the value of 6, they will know that the reward of the 4-th arm is deterministic. ln this problem, we will evaluate: a) the pseudo-regret for each value of (9, and b) the Bayesian regret over the uniform distribution that is specied on 6. (a) Report the identity of the optimal arm in each of the cases 0 : 17 2, 3. (b) Consider the Thompson sampling algorithm run with the uniform prior over the instances 9 : 1, 2, 3. Evaluate the probability distribution over the action taken at the rst round, which is denoted by A1. (c) Suppose that Nature gave us 6 : 1 (therefore, the true instance is given by .114 : 2/3, Jug : 1/2, #3 : 1/3, #4 : 1/4). Evaluate the probability distribution over the action taken at the second round, which is denoted by A2, in all six cases: a) A1 : 1 and observed reward equal to 1, b) A1 : 1 and observed reward equal to O, c) A1 : 2 and observed reward equal to 1, d) A1 = 2 and observed reward equal to 0, e) A1 = 3 and observed reward equal to 1, f) A1 = 3 and observed reward equal to O

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Mathematics questions

Question

Define the terms innuendo and qualified privilege.

Answered: 1 week ago

Question

Explain what is meant by the terms unitarism and pluralism.

Answered: 1 week ago