Question: 1 A proper policy for is MDP is one that is guaranteed to reach a terminal state. Show that it is possible for a passive
1 A proper policy for is MDP is one that is guaranteed to reach a terminal state. Show that it is possible for a passive ADP agent to learn a transition model for which its policy is improper even if is proper for the true MDP; with such models, the POLICY-EVALUATION step may fail if . Show that this problem cannot arise if POLICY-EVALUATION is appl ied to the learned model only at the end of a trial.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
