In Section 1 we introduced an example with two states and two actions in which the reward

Question:

In Section 1 we introduced an example with two states and two actions in which the reward function was $r(A, 1)=4, r(B, 1)=3$, $r(A, 2)=2, r(B, 2)=5$ and the transition matrices were as below. For a finite horizon stochastic dynamic programming problem with time horizon $T=6$ and terminal reward $R(A)=3, R(B)=5$, find the optimal policy.

image text in transcribed