Problem 1 ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v For example, we can apply the temporal difference ( TD ) learning algorithm given by v t 1 ( s ) v t ( s ) t ( s ) I s t ) s , where t r t v t ( s t 1 ) v t ( s t ) is known as TD error Alternatively, we can apply the n step TD learning algorithm given by v t 1 ( s ) v t ( s ) ( G t ( n ) v t ( s ) ) I s t ) s , where G t ( n ) r t r t 1 dots n 1 r t n 1 n v t ( s t n ) for n 1 , 2 , dots Note that t G t ( 1 ) v t ( s t ) The n step TD algorithms for n use bootstrapping Therefore, they use biased estimate of v On the other hand, as n , the n step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v However, these approaches delay the update for n stages and we update the value function estimate only for the current state As an intermediate step to address these challenges, we first introduce the return algo rithm given by v t 1 ( s ) v t ( s ) ( G t v t ( s ) ) I s t ) s , where given i n 0 , 1 , we define G t ( 1 ) n 1 n 1 G t ( n ) taking a weighted average of G t ( n ) ' s ( a ) By the definition of G t ( n ) , we can show that G t ( n ) r t G t 1 ( n 1 ) Derive an analogous recursive relationship for G t and G t 1 ( b ) Show that the term G t v t ( s ) in the return update can be written as the sum of TD errors The TD algorithm, Monte Carlo method and return algorithm looks forward to approx imate v Alternatively, we can look backward via the eligibility trace method TheTD ( ) algorithm is given by z t ( s ) z t 1 ( s ) I s ) s t , AAsinS v t 1 ( s ) v t ( s ) t z t ( s ) , AAsinS, where z t i n R S is called the eligibility vector and the initial z 1 ( s ) 0 for all s

Question

Problem 1   ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v   For example, we can apply the temporal difference ( TD ) learning algorithm given by v t   1 ( s )   v t ( s )   t ( s )   I   s t )   s , where t     r t   v t ( s t   1 )   v t ( s t ) is known as TD error  Alternatively, we can apply the n   step TD learning algorithm given by v t   1 ( s )   v t ( s )   ( G t ( n )   v t ( s ) )   I   s t )   s , where G t ( n )     r t   r t   1   dots   n   1 r t   n   1   n v t ( s t   n ) for n   1 , 2 , dots  Note that t   G t ( 1 )   v t ( s t )   The n   step TD algorithms for n use bootstrapping  Therefore, they use biased estimate of v   On the other hand, as n , the n   step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v   However, these approaches delay the update for n stages and we update the value function estimate only for the current state  As an intermediate step to address these challenges, we first introduce the   return algo   rithm given by v t   1 ( s )   v t ( s )   ( G t   v t ( s ) )   I   s t )   s , where given i n   0 , 1   , we define G t     ( 1   ) n   1 n   1 G t ( n ) taking a weighted average of G t ( n ) ' s   ( a ) By the definition of G t ( n ) , we can show that G t ( n )   r t   G t   1 ( n   1 )   Derive an analogous recursive relationship for G t and G t   1   ( b ) Show that the term G t   v t ( s ) in the   return update can be written as the sum of TD errors  The TD algorithm, Monte Carlo method and   return algorithm looks forward to approx   imate v   Alternatively, we can look backward via the eligibility trace method  TheTD ( ) algorithm is given by z t ( s )   z t   1 ( s )   I   s )   s t , AAsinS v t   1 ( s )   v t ( s )   t z t ( s ) , AAsinS, where z t i n R   S   is called the eligibility vector and the initial z   1 ( s )   0 for all s

Accepted Answer

The Answer is in the image, click to view ...

Question

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example,

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Designing Data Intensive Applications The Big Ideas Behind Reliable Scalable And Maintainable Systems

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question