Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v
Problem pt Given a Markov stationary policy pi consider the policy evaluation problem to compute v
pi
For example, we can apply the temporal difference TD learning algorithm given by v
t
sv
t
salpha delta
t
sI
s
t
s
where delta
t
:r
t
gamma v
t
s
t
v
t
s
t
is known as TD error. Alternatively, we can apply the nstep TD learning algorithm given by v
t
sv
t
salpha G
t
n
v
t
sI
s
t
s
where G
t
n
:r
t
gamma r
t
gamma
n
r
tn
gamma
n
v
t
pi
s
tn
for n Note that delta
t
G
t
v
t
s
t
The nstep TD algorithms for ninfty use bootstrapping. Therefore, they use biased estimate of v
pi
On the other hand, as ninfty the nstep TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v
pi
However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the lambda return algorithm given by v
t
sv
t
salpha G
t
lambda
v
t
sI
s
t
s
where given lambda in we define G
t
lambda
:lambda
n
infty
lambda
n
G
t
n
taking a weighted average of G
t
ns
a By the definition of G
t
n
we can show that G
t
n
r
t
gamma G
t
n
Derive an analogous recursive relationship for G
t
lambda
and G
t
lambda
b Show that the term G
t
lambda
v
t
s in the lambda return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and lambda return algorithm looks forward to approximate v
pi
Alternatively, we can look backward via the eligibility trace method. TheTDlambda
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started