Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 20, 2024

Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example,

Problem

1 . (50

)

Given a Markov stationary policy

,

consider the policy evaluation problem

to compute

v^{} .

For example, we can apply the temporal difference

(

)

learning algorithm

given by

v_{t + 1} (s) = v_{t} (s) +_{t} (s) * I_{{s_{t})} = s,

where

_{t}

= r_{t} + v_{t} (s_{t + 1}) - v_{t} (s_{t})

is known as TD error. Alternatively, we can apply the

n -

step

TD learning algorithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{(n)} - v_{t} (s)) * I_{{s_{t})} = s,

where

G_{t}^{(n)}

= r_{t} + r_{t + 1} +

dots

+^{n - 1} r_{t + n - 1} +^{n} v_{t}^{} (s_{t + n})

for

n = 1, 2,

dots. Note that

_{t} =

G_{t}^{(1)} - v_{t} (s_{t}) .

The

n -

step TD algorithms for

n

use bootstrapping. Therefore, they use biased estimate

v^{} .

On the other hand, as

n,

the

n -

step TD algorithm becomes a Monte Carlo method,

where we use an unbiased estimate of

v^{} .

However, these approaches delay the update for

n

stages and we update the value function estimate only for the current state.

As an intermediate step to address these challenges, we first introduce the

-

return algo

-

rithm given by

v_{t + 1} (s) = v_{t} (s) + (G_{t}^{} - v_{t} (s)) * I_{{s_{t})} = s,

where given

i n [0, 1],

we define

G_{t}^{}

= (1 -)_{n = 1}^{}^{n - 1} G_{t}^{(n)}

taking a weighted average of

G_{t}^{(n)}'

.

(

)

By the definition of

G_{t}^{(n)},

we can show that

G_{t}^{(n)} = r_{t} + G_{t + 1}^{(n - 1)} .

Derive an analogous

recursive relationship for

G_{t}^{}

and

G_{t + 1}^{} .

(

)

Show that the term

G_{t}^{} - v_{t} (s)

in the

-

return update can be written as the sum of TD errors.

The TD algorithm, Monte Carlo method and

-

return algorithm looks forward to approx

-

imate

v^{} .

Alternatively, we can look backward via the eligibility trace method. TheTD

()

algorithm is given by

z_{t} (s) = z_{t - 1} (s) + I_{{s)} = s_{t},

AAsinS

v_{t + 1} (s) = v_{t} (s) +_{t} z_{t} (s),

AAsinS,

where

z_{t} i n R^{| S |}

is called the eligibility vector and the initial

z_{- 1} (s) = 0

for all

s .

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Intelligent Databases Object Oriented Deductive Hypermedia Technologies

Authors: Kamran Parsaye, Mark Chignell, Setrag Khoshafian, Harry Wong

1st Edition

0471503452, 978-0471503453

More Books

Students also viewed these Databases questions

Question

★★★★★

A shell-and-tube heat exchanger consists of 135 thin-walled tubes in a double-pass arrangement, each of 12.5-mm diameters with a total surface area of 47.5 m 2 . Water (the tube-side fluid) enters...

Answered: 1 week ago

Question

★★★★★

Determine for the following functions. Then give the horizontal asymptote(s) of f (if any). lim f(x) and lim f(x) Vx6 + 8 f(x) 4x2 + V3x4 + 1

Answered: 1 week ago

Question

★★★★★

=+1. 60% of the goods had been sold for 8,000, but only 6,500 had so far been received.

Answered: 1 week ago

Question

★★★★★

What is the relationship between activity-based management and just-in-time inventory?

Answered: 1 week ago

Question

★★★★★

Scenario A. A researcher is interested in understanding how a company's work from home policy will influence worker productivity, so she compares two companies. Company 1, Vance Refrigeration, allows...

Answered: 1 week ago

Question

★★★★★

Discuss the relationship between Japan GDP, GDP per capita, and population growth rate.Start with your hypothesis about what you expect this relationship to be, and why.

Answered: 1 week ago

Question

★★★★★

The documentary states child labour are more like governments (vs companies) dictating what we become, who we believe we are and what we believe in. Basically acting like politicians but elected by...

Answered: 1 week ago

Question

★★★★★

Exercise 2.30. Assume that of all twins are identical twins. You learn that Miranda is expecting twins, but you have no other information. (a) Find the probability that Miranda will have two girls....

Answered: 1 week ago

Question

★★★★★

Create an organizational metaphor of your own and connect it to business ethics. How do we use this metaphor to help us understand and to respond to business ethics?

Answered: 1 week ago

Question

★★★★★

In a factory, the operation runs five days a week for eight hours a day. Each week includes one 15-minute meeting and a total time of 45-minute is spent on maintenance. Employees get one half- hour...

Answered: 1 week ago

Question

★★★★★

5. Instead of making the lump-sum deposit of $6,000 described in problem 4, you decide to deposit $100 at the end of each month in an annuity that pays 6.5% compounded monthly. (a) How much will you...

Answered: 1 week ago

Question

★★★★★

(a) What source of power does Latoya have, and what type of power is she using during the meeting? (b) Is the memo a wise political move for Latoya? What may be gained/lost by sending it?

Answered: 1 week ago

Question

★★★★★

What source of power does Beth have, and what type of power is she using during the meeting?

Answered: 1 week ago

Question

★★★★★

Describe your relationship with your current peers and members from other departments. How do you cooperate with them, compete with them, and criticize them? How can you improve your relationship...

Answered: 1 week ago

Previous Question Next Question