Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem 3 . ( 5 0 pt ) Consider an infinite horizon MDP , characterized by = ( : , , , , : )
Problem
pt
Consider an infinite horizon MDP
characterized by
:
:
and
:
times
We would like to evaluate the value of a Markov stationary policy
:
However we do not know the transition kernel
Rather than applying
a model
free approach, we decided to use a model
based approach where we first estimate
the underlying transition kernel by follow some fully stochastic policy in the MDP
for good
exploration
and observe the triples
times
times
for
dots. Let widehat
be our
estimate of
based on the data collected. Now, we can apply value iteration directly as if the
underlying MDP is widehat
:
widehat
:
and obtain widehat
Prove the simulation lemma bounding the difference between hat
and the true value of the
policy, denoted by
by showing that
widehat
where
is the initial state and
is the discounted state visitation distribution under policy
Note that the difference
widehat
gets smaller with the smaller model approximation
error
However the impact of model approximation error gets larger
with
~~
as the approximation error propagates more across stages.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started