Answered step by step
Verified Expert Solution
Question
1 Approved Answer
We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the
We claimed that its demonstration crucially depended on the conjunction of three factors: off-policy updating, bootstrapping, and generalisation. In this question, we consider the effect of removing one of these factors: specifically we replace the off-policy update with an on-policy update. Refer to the MDP in the counterexample on Slide 9 of the lecture. Suppose that episodes always start at state $. Since there is a deterministic transition to 82, the number of time steps per episode in 8 is exactly T(81) = 1. Similarly, what is T(82), the expected number of time steps per episode in 82? Naturally T(82) must depend on ; assume (0, 1). We use the same linear architecture as described in the lecture. For k0, the new update rule we propose is
Step by Step Solution
★★★★★
3.51 Rating (154 Votes )
There are 3 Steps involved in it
Step: 1
The detailed ...Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started