Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by M = ( :S , A , r ,
Problem pt Consider an infinite horizon MDP characterized by M:SArpgamma :) and r:Stimes A We would like to evaluate the value of a Markov stationary policy pi :SDelta A However, we do not know the transition kernel p Rather than applying a modelfree approach, we decided to use a modelbased approach where we first estimate the underlying transition kernel by follow some fully stochastic policy in the MDP for good exploration and observe the triples skakskinStimes Atimes S for kdots Let widehatp be our estimate of p based on the data collected. Now, we can apply value iteration directly as if the underlying MDP is widehatM:SArwidehatpgamma :) and obtain widehatvpi Prove the simulation lemma bounding the difference between hatvpi and the true value of the policy, denoted by vpi by showing that vpi swidehatvpi sgamma gamma Esdspi api swidehatpsapsa where s is the initial state and dspi is the discounted state visitation distribution under policy pi Note that the difference vpi swidehatvpi s gets smaller with the smaller model approximation error widehatpsapsa However, the impact of model approximation error gets larger with gamma as the approximation error propagates more across stages.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started