In a Markov decision problem, another criterion often used, different than the expected average return per unit

Question:

In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a number

a, 0 < a < 1, and try to choose a policy so as to maximize E[o a'R(X,, a)]. (That is, rewards at time n are discounted at rate a".) Suppose that the initial state is chosen according to the probabilities

b. That is, P(X = i) =

b, i = 1,..., n

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question
Question Posted: