Question: Figure 17.13 shows two MDPs: one, M, represents a two-armed bandit where one has the choice to continue with the first arm or to switch
Figure 17.13 shows two MDPs: one, M, represents a two-armed bandit where one has the choice to continue with the first arm or to switch permanently to a second arm with fixed reward λ; the other, a restart MDP Ms , gives one a choice to continue with the first arm or restart the sequence. The figure illustrates the construction of Ms for the case where M has a deterministic reward sequence and just two arms (including the λ-arm). Explain how to construct Ms when M has k + 1 arms (including the λ-arm) and each arm is a general MRP. Show that the value of Ms equals the minimum value of λ such that one would be indifferent in M between pulling the best arm and switching to the λ-arm forever.
Step by Step Solution
3.44 Rating (167 Votes )
There are 3 Steps involved in it
To construct the MDP Ms we need to add a new state to the original MDP M which represents the state where we have chosen to restart the sequence From ... View full answer
Get step-by-step solutions from verified subject matter experts
