Consider the following deterministic MDP with 1-dimensional continuous states and actions and a finite task horizon: State
Question:
Consider the following deterministic MDP with 1-dimensional continuous states and actions and a finite task horizon:
State Space S: R
Action Space A: R
Reward Function: R(s, a, s') = −qs2 − ra2 where r > 0 and q ≥ 0 Deterministic Dynamics/Transition Function: s' = cs + da (i.e., the next state s' is a deterministic function of the action a and current state s)
Task Horizon: T ∈ N
Discount Factor: γ = 1 (no discount factor)
Hence, we would like to maximize a quadratic reward function that rewards small actions and staying close to the origin. In this problem, we will design an optimal agent π∗t and also solve for the optimal agent’s value function V∗t for all timesteps.
By induction, we will show that V∗t is quadratic. Observe that the base case t = 0 trivially holds because V∗0 (s) = 0 For all parts below, assume that V∗t (s) = −pts2 (Inductive Hypothesis).
a. (i) Write the equation for V∗ t+1(s) as a function of s, q, r, a, c, d, and pt . If your expression contains max, you do not need to simplify the max.
(ii) Now, solve for π∗ t+1(s). Recall that you can find local maxima of functions by computing the first derivative and setting it to 0.
b. Assume π∗ t+1 = kt+1s for some kt+1 ∈ R. Solve for pt+1 in V∗ t+1(s) = −pt+1s2.
Step by Step Answer:
Artificial Intelligence A Modern Approach
ISBN: 9780134610993
4th Edition
Authors: Stuart Russell, Peter Norvig