Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 02, 2024

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy pi , consider the policy evaluation problem to compute v ^

Problem

1 . (50

)

Given a Markov stationary policy

\

,

consider the policy evaluation problem to compute v

^\

.

For example, we can apply the temporal difference

(

)

learning algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

\

delta

_

(

)_{

_

=

},

where

\

delta

_

=

_

+ \

gamma v

_

(

_

+ 1) -

_

(

_

)

is known as TD error. Alternatively, we can apply the n

-

step TD learning algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

(

_

^(

) -

_

(

))_{

_

=

},

where G

_

^(

)

=

_

+ \

gamma r

_

+ 1 + . . . + \

gamma

^

- 1

_

+

- 1 + \

gamma

^

n v

_

^\

(

_

+

)

for n

= 1, 2, . . .

Note that

\

delta

_

=

_

^(1) -

_

(

_

) .

The n

-

step TD algorithms for n

< \

infty use bootstrapping. Therefore, they use biased estimate of v

^\

.

On the other hand, as n

- > \

infty

,

the n

-

step TD algorithm becomes a Monte Carlo method, where we use an unbiased estimate of v

^\

.

However, these approaches delay the update for n stages and we update the value function estimate only for the current state. As an intermediate step to address these challenges, we first introduce the

\

lambda

-

return algorithm given by v

_

+ 1 (

) =

_

(

) + \

alpha

(

_

^\

lambda

-

_

(

))_{

_

=

},

where given

\

lambda in

[0, 1],

we define G

_

^\

lambda :

= (1 - \

lambda

)_

= 1^\

infty

\

lambda

^

- 1

_

^(

)

taking a weighted average of G

_

^(

),

. (

)

By the definition of G

_

^(

),

we can show that G

_

^(

) =

_

+ \

gamma G

_

+ 1^(

- 1) .

Derive an analogous recursive relationship for G

_

^\

lambda and G

_

+ 1^\

lambda

. (

)

Show that the term G

_

^\

lambda

-

_

(

)

in the

\

lambda

-

return update can be written as the sum of TD errors. The TD algorithm, Monte Carlo method and

\

lambda

-

return algorithm looks forward to approximate v

^\

.

Alternatively, we can look backward via the eligibility trace method. TheTD

(\

lambda

)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beginning Microsoft SQL Server 2012 Programming

Authors: Paul Atkinson, Robert Vieira

1st Edition

1118102282, 9781118102282

More Books

Students also viewed these Databases questions

Question

★★★★★

A very long conducting tube (hollow cylinder) has inner radius a and outer radius b. It carries charge per unit length +a, where a is a positive constant with units of C/m. A line of charge lies...

Answered: 1 week ago

Question

★★★★★

When you check the air pressure in a tire, a little air always escapes; the process of making the measurement changes the quantity being measured. Think of other examples of measurements that change...

Answered: 1 week ago

Question

★★★★★

1. Use some system to make sure you give each student practice in reading, speaking, and answering questions.

Answered: 1 week ago

Question

★★★★★

Jerrys Tree Services is trying to raise debt funds from a prospective venture investor, SureWay LLC. SureWay indicated to Jerry Lau that the annual interest rate on risky venture loans is currently...

Answered: 1 week ago

Question

★★★★★

Hank One offers to lend you $50,000 at a nominal rate of 7.0%, compounded monthly, Bank Two also offers to lend you the 530,000 , but it will charge a nominal rate of 7.29, compounded annually. What...

Answered: 1 week ago

Question

★★★★★

The ratio that Indicates how much one Common Stock worth if the company to be liquidated according to book value of assets is: a. Book value per share. b. Market value per share. c. Cost value per...

Answered: 1 week ago

Question

★★★★★

CASE STUDY 1 Read the case below and answer the questions that follow: [10] John applied for the post of supervisor at Amazing Textiles (Pty) Ltd, a textile factory. At his final interview he made it...

Answered: 1 week ago

Question

★★★★★

Case study In the highly competitive world of music streaming, Pandora chose to maintain a very employee centric model of work. The organization was proud of several HR initiatives which it thought...

Answered: 1 week ago

Question

★★★★★

An organisation chart graphically depicts an organisation structure. It shows how the work of different people in the organisation coordinated and integrated ( Chandler , 1 9 8 8 ) . Construction...

Answered: 1 week ago

Question

★★★★★

1. AS THE AIRBUS A-380 IS WIDELY CONSIDERED A TECHNOLOGICAL MARVEL, IT IS HARD TO COMPREHEND WHY THE OF SALES OF THIS AERONAUTICAL PRODUCT IS DECLINING. HAVING BUILT A STATE-OF-THE-ART AIRCRAFT, THEN...

Answered: 1 week ago

Question

★★★★★

John Wozniak works as a cashier at a playground equipment retail outlet. With no supervisor approval required, John was able to process fictitious refunds for customer sales. What scheme is this?...

Answered: 1 week ago

Question

★★★★★

Understand how to use positive reinforcement to improve relationships and reward behavior.

Answered: 1 week ago

Question

★★★★★

Explain how to reward individual and team performance.

Answered: 1 week ago

Question

★★★★★

c. Time off. Employees who achieve sales goals established by management could earn up to four hours of paid time off each week.

Answered: 1 week ago

Previous Question Next Question