Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Problem 5. (40pt) As shown in the class, given an MDP problem, the mirror descent update with B(, ) = ses d (s)DKL(T( |
Problem 5. (40pt) As shown in the class, given an MDP problem, the mirror descent update with B(, ) = ses d (s)DKL(T( | s)||~(., s)) is given by Tk+1(s) = argmax (|S) EA(A) QT (s, a) ( (a | s) k(a | 8)) DKL(T( | 8)||Tk( | 8)) aA for all s = S, where d A(S) is the discounted state visitation distribution for the Markov stationary policy , EA(S) is the initial state distribution, and DKL (||) denotes the KL divergence. Show that this update is equivalent to Tk+1 (as) k (a | s) exp (akQk (s, a)) 'EA (a' | 8) exp (QT (8, a')) V(s,a). Note that this is a constrained optimization problem due to ( | s) A(A). You can apply Lagrange multiplier methods with the first-order optimality conditions to solve it as the KL divergence DKL (7||~) is a convex function of 7 when is fixed.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started