Question: We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the
We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the effect of removing one of these factors: specifically we replace the off-policy update with an on-policy update. Refer to the MDP in the counterexample on Slide 9 of the lecture. Suppose that episodes always start at state $. Since there is a deterministic transition to 82, the number of time steps per episode in 8 is exactly T(81) = 1. Similarly, what is T(82), the expected number of time steps per episode in 82? Naturally T(82) must depend on ; assume (0, 1). We use the same linear architecture as described in the lecture. For k0, the new update rule we propose is
Step by Step Solution
3.51 Rating (154 Votes )
There are 3 Steps involved in it
The detailed ... View full answer
Get step-by-step solutions from verified subject matter experts
