Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v ^
Problem pt Given a Markov stationary policy pi consider the policy evaluation problem to compute vpi For example, we can apply the temporal difference TD learning algorithm given by vtsvtsalpha delta tssts where delta t:rtgamma vtstvtst is known as TD error. Alternatively, we can apply the nstep TD learning algorithm given by vtsvtsalpha Gtnvtssts where Gtn:rtgamma rtgamma n rtngamma n vtpi stn for n Note that delta t Gtvtst The nstep TD algorithms for ninfty use bootstrapping. Therefore, they use biased estimate of vpi On the other hand, as n infty the nstep TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of vpi However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the lambda return algorithm given by vtsvtsalpha Gtlambda vtssts where given lambda in we define Gtlambda :lambda ninfty lambda n Gtn taking a weighted average of Gtnsa By the definition of Gtn we can show that Gtnrtgamma Gtn Derive an analogous recursive relationship for Gtlambda and Gtlambda b Show that the term Gtlambda vts in the lambda return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and lambda return algorithm looks forward to approximate vpi Alternatively, we can look backward via the eligibility trace method. TheTDlambda algorithm is given by ztsgamma lambda ztssst s in S; vtsvtsalpha delta t zts s in S where zt in S is called the eligibility vector and the initial zs for all sc In the TDlambda algorithm, zt is computed recursively. Express zt only in terms of the states visited in the past. This representation of the eligibility vector will show that eligibility vectors combine the frequency heuristic and recency heuristic to address the credit assignment problem. For the rewards received, the frequency heuristic assigns higher credit to the frequently visited states while the recency heuristic assigns higher credit to the recently visited states. The eligibility vector assigns higher credits to the frequently and recently visited states. Note that in the TDlambda algorithm, value function estimate for every state gets updated different from the nstep TD algorithms, where only the estimate for the current state gets updated. If a state has not been visited recently and frequently then the eligibility of that state ie the associated entry of the eligibility vector will be close to zero. Therefore, the update via the TDerror will take very small steps for such states. Though lambda return is forwardlooking while TDlambda is backward looking, they are equivalent as you will show next for the finite horizon problem with horizon length Tinfty d Assume that the initial value function estimates are zero, ie vs for all s Then, the recursive update in the lambda return algorithm yields that vTs can be written as vTstTalpha Gtlambda vtststs Correspondingly, the recursive update in the TDlambda algorithm yields that vTs can be written as vTstTalpha delta t zts Show that tTalpha delta t ztstTalpha Gtlambda vtststs s
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started