In a Markov decision problem, another criterion often used, different than the expected average return per unit

Question:

In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return.

In this criterion we choose a number α, 0∞

i=0αiR(Xi, ai )] (that is, rewards at time n are discounted at rate αn). Suppose that the initial state is chosen according to the probabilities bi . That is,image text in transcribed

For a given policy β let yja denote the expected discounted time that the process is in state j and action a is chosen. That is,image text in transcribed

where for any event A the indicator variable IA is defined byimage text in transcribed

or, in other words, a yja is the expected discounted time in state j under β.

(b) Show thatimage text in transcribed

(c) Let {yja} be a set of numbers satisfyingimage text in transcribed

Argue that yja can be interpreted as the expected discounted time that the process is in state j and action a is chosen when the initial state is chosen according to the probabilities bj and the policy β, given byimage text in transcribed

is employed.
Hint: Derive a set of equations for the expected discounted times when policy β is used and show that they are equivalent to Eq. (4.38).

(d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear programimage text in transcribed

where the y ∗
ja are the solutions of the linear program.

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question
Question Posted: