Problem 3 ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M ( S , A , r , p , gamma ) and r S times A 0 , 1 We would like to evaluate the value of a Markov stationary policy pi S Delta ( A ) However, we do not know the transition kernel p Rather than applying a model free approach, we decided to use a model based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP ( for good exploration ) and observe the triples ( s ( k ) , a ( k ) , s ( k 1 ) ) inS times A times S for k 0 , 1 , dots Let widehat ( p ) be our estimate of p based on the data collected Now, we can apply value iteration directly as if the underlying MDP is widehat ( M ) ( S , A , r , widehat ( p ) , gamma ) and obtain widehat ( v ) ( pi ) Prove the simulation lemma bounding the difference between hat ( v ) ( pi ) and the true value of the policy, denoted by v ( pi ) , by showing that v ( pi ) ( s ( 0 ) ) widehat ( v ) ( pi ) ( s ( 0 ) ) ( gamma ) ( ( 1 gamma ) ( 2 ) ) E ( s d ( s ( 0 ) ) ( pi ) , a pi ( s ) ) widehat ( p ) ( s , a ) p ( s , a ) ( 1 ) , where s ( 0 ) is the initial state and d ( s ( 0 ) ) ( pi ) is the discounted state visitation distribution under policy pi Note that the difference v ( pi ) ( s ( 0 ) ) widehat ( v ) ( pi ) ( s ( 0 ) ) gets smaller with the smaller model approximation error widehat ( p ) ( s , a ) p ( s , a ) ( 1 ) However, the impact of model approximation error gets larger with gamma 1 as the approximation error propagates more across stages

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 01, 2024

Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( :S , A , r ,

Problem

3 . (50

)

Consider an infinite horizon MDP

,

characterized by M

= (

,

,

,

, \

gamma :) and r:S

\

times A

- > [0, 1] .

We would like to evaluate the value of a Markov stationary policy

\

pi :S

- > \

Delta

(

) .

However, we do not know the transition kernel p

.

Rather than applying a model

-

free approach, we decided to use a model

-

based approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP

(

for good exploration

)

and observe the triples

(

_(

),

_(

),

_(

+ 1))

inS

\

times A

\

times S for k

= 0, 1,

dots

.

Let widehat

(

)

be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is widehat

(

) = (

,

,

,

widehat

(

), \

gamma :) and obtain widehat

(

)^(\

) .

Prove the simulation lemma bounding the difference between hat

(

)^(\

)

and the true value of the policy, denoted by v

^(\

),

by showing that

|

^(\

) (

_(0)) -

widehat

(

)^(\

) (

_(0)) | < = (\

gamma

) / ((1 - \

gamma

)^(2))

_(

_(

_(0))^(\

),

\

(

)) | |

widehat

(

) (* |

,

) -

(* |

,

) | |_(1),

where s

_(0)

is the initial state and d

_(

_(0))^(\

)

is the discounted state visitation distribution under policy

\

.

Note that the difference

|

^(\

) (

_(0)) -

widehat

(

)^(\

) (

_(0)) |

gets smaller with the smaller model approximation error

| |

widehat

(

) (* |

,

) -

(* |

,

) | |_(1) .

However, the impact of model approximation error gets larger with

\

gamma

1

as the approximation error propagates more across stages.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David Kroenke

4th Edition

0136086535, 9780136086536

More Books

Students also viewed these Databases questions

Question

★★★★★

Kobayashi Corporation reports in the current liability section of its statement of financial position at December 31, 2012 (its year-end), short-term obligations of $15,000,000, which includes the...

Answered: 1 week ago

Question

★★★★★

Bob Forrester is retired and owns a home. He has these assets and liabilities. a. Calculate Bob's net worth. b. Two years ago, Bob's net worth was $650,000. Last year, his net worth as $740,500. What...

Answered: 1 week ago

Question

★★★★★

What is historiography? What issues are typically studied in historiography?

Answered: 1 week ago

Question

★★★★★

The Piper Cherokee (a light, single-engine general aviation aircraft) has a wing area of 170 ft 2 and a wing span of 32 ft. Its maximum gross weight is 24501b. The wing uses an NACA 65-415 airfoil,...

Answered: 1 week ago

Question

★★★★★

Each of the four independent situations below describes a sales-type lease in which annual lease payments of $18,500 are payable at the beginning of each year. Each is a finance lease for the lessee....

Answered: 1 week ago

Question

★★★★★

Gamma Industries has net income of $3,800,000, and it has 1,490,000 shares of common stock outstanding. The company's stock currently trades at $67 a share. Gamma is considering a plan in which it...

Answered: 1 week ago

Question

★★★★★

Determine the moment generated by the 425 N force applied at point D about point C Note: Remember Counterclockwise moments are considered positive. Justify your answer to receive full points. 215 mm...

Answered: 1 week ago

Question

★★★★★

A recruiter is using a consistent set of questions to interview candidates for retail cashier roles in locations across the US Which question ( s ) would be inappropriate to ask? Select all that...

Answered: 1 week ago

Question

★★★★★

A customer entrusts certain important documents for safe custody to his bank.The bank keeps the documents in a wooden box.Later it is found that the documents were destroyed by white ants.What is the...

Answered: 1 week ago

Question

★★★★★

Two sides of a triangle measure 36 m and 49 m One possible dimension of the third side is Two sides of a triangle measure 36 m. and 49 m. One possible dimension of the third side is?

Answered: 1 week ago

Question

★★★★★

59. Trebecker Construction plans to discontinue its roofing segment which last year generated a contribution margin of $65,000 and incurred $70,000 in fixed costs. If the segment is discontinued,...

Answered: 1 week ago

Question

★★★★★

What is the principal difference between exempt and nonexempt employees?

Answered: 1 week ago

Question

★★★★★

If teamwork is so critically important in contemporary health care organizations, then why does this chapter place such strong emphasis on one-to-one relationships between supervisors and each of...

Answered: 1 week ago

Question

★★★★★

Why is a department managers visibility and availability to employees considered to be important?

Answered: 1 week ago

Previous Question Next Question