Question: We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the

We claimed that its demonstration crucially depended on the conjunction of three

We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the effect of removing one of these factors: specifically we replace the off-policy update with an on-policy update. Refer to the MDP in the counterexample on Slide 9 of the lecture. Suppose that episodes always start at state $. Since there is a deterministic transition to 82, the number of time steps per episode in 8 is exactly T(81) = 1. Similarly, what is T(82), the expected number of time steps per episode in 82? Naturally T(82) must depend on ; assume (0, 1). We use the same linear architecture as described in the lecture. For k0, the new update rule we propose is

Step by Step Solution

3.51 Rating (154 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

The detailed ... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Computer Engineering Questions!