Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0,1}, and the set of admissible actions at any state x
Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0,1}, and the set of admissible actions at any state x ES is A(x) = {1, 2}. The cost function C(x, a) is given by C(0,1) = 1, C(1, 1) = 2, C(0,2) = 0, C(1, 2) = 2. 1 The transition probabilities p(y|x, a) are fully determined by 1 2 p(0|0,1) 2 P(0|0,2) 1 4' p(0|1,1) p(0|1,2) 1 3 3 Let 1 BE 2 (a) Starting with W(2) W(0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function W3 by W(3) := TW. Then what is the stationary policy obtained as the minimiser of TzW(3) ? Determine with justifications whether it is an optimal policy. [40 marks] (6) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g) = f to generate policies until you reach an optimal stationary policy. (10 marks] (Hint: adjust the dynamic programming operator Tg as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.) Exercise 3 Consider a discounted dynamic programming problem with the state space S = {0,1}, and the set of admissible actions at any state x ES is A(x) = {1, 2}. The cost function C(x, a) is given by C(0,1) = 1, C(1, 1) = 2, C(0,2) = 0, C(1, 2) = 2. 1 The transition probabilities p(y|x, a) are fully determined by 1 2 p(0|0,1) 2 P(0|0,2) 1 4' p(0|1,1) p(0|1,2) 1 3 3 Let 1 BE 2 (a) Starting with W(2) W(0)(x) = 0 for all x E S, use the value iteration algorithm to approximate the value function W3 by W(3) := TW. Then what is the stationary policy obtained as the minimiser of TzW(3) ? Determine with justifications whether it is an optimal policy. [40 marks] (6) Now let f be the stationary policy that chooses action 1 in both states 0 and 1. Apply the policy iteration algorithm with the initial policy g) = f to generate policies until you reach an optimal stationary policy. (10 marks] (Hint: adjust the dynamic programming operator Tg as well as the value iteration and policy iteration algorithms accordingly, as we are dealing with a minimization problem here.)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started