54. In a Markov decision problem, another criterion often used, different than the expected average return per

Question:

54. In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a number

a, 0 < a < 1, and try to choose a policy so as to maximize Å[Ó7=ï oilR(Xi9 0,)]. (That is, rewards at time ç are discounted at rate an.) Suppose that the initial state is chosen according to the probabilities bt. That is, P[X0 = i] = bi9 i= 1 , . . . , A2 For a given policy â let yja denote the expected discounted time that the process is in state j and action a is chosen. That is, yja = Et Ó otni{Xn=Jtan=a}

where for any event A the indicator variable IA is defined by 1, if A occurs

(a) Show that ^A ' 0, otherwise Ó yja = å Ó «%
ç = 0 or, in other words, Óá y ja is the expected discounted time in state j under p.

(b) Show that Ó Ó y ja = É _ »
j a É IX Óyja = bj + áÓ ÓyiaPijia)
a i a Hint: For the second equation, use the identity V«+i=i) = Ó Ó Vn=i.fl«=jk+i=i}
é a Take expectations of the preceding to obtain EV{x„+i=j}] = Ó ÓÅ[É{÷^,á_}]Ñõ(á).
i a

(c) Let {yJa) be a set of numbers satisfying Ç Ç = Ó yja = bj + áÓ Ó yiaPijia) (4.27)
á / a Argue that can be interpreted as the expected discounted time that the process is in state j and action a is chosen when the initial state is chosen according to the probabilities bj and the policy â, given by A«0= y "
Óá yia is employed.
Hint: Derive a set of equations for the expected discounted times when policy â is used and show that they are equivalent to Equation (4.27).

(d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program maximize Ó lyjaRU,<*), j a such that Ó Ó yja = —^—, j a 1 - Oi Ó y ja = bj + OL Ó Ó Ëá^ß/ß*)»
a / a yja > 0, ally, a;
and then defining the policy â* by /?f (á) = - â -
Ua J ia where the yja are the solutions of the linear program.
55. Consider an N-state Markov chain that is ergodic and let ð,, / = 1, ...,Í, denote the limiting probabilities. Suppose that a reward R(i)
is earned whenever the process is in state /. Then Ó] = ï *s the reward earned during the first ç + 1 time periods. (Of course, Xj is the state of the Markov chain at time j.) Show that

Fantastic news! We've Found the answer you've been seeking!