Question
Consider the following grid environment. Starting from any unshaded square, you can move up, down, left, or right. Actions are deterministic and always succeed (e.g.,
Consider the following grid environment. Starting from any unshaded square, you can move up, down, left, or right. Actions are deterministic and always succeed (e.g., going left from state 1 goes to state 0) unless they will cause the agent to run into a wall. The thicker edges indicate walls and attempting to move in the direction of a wall results in staying in the same square. Taking any action from the green target square (no. 5) earns a reward of +5 and ends the episode. Otherwise, each move is associated with some reward r e {-1,0, +1}. Assume the discount factor y = 1 unless otherwise specified. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
a. Define the reward r for all states (except state 5 whose rewards are specified above) that would cause the optimal policy to return the shortest path to the green target square (no. 5).
b. Using r from part (a), find the optimal value function for each square.
c. Does setting y = 0.8 change the optimal policy?
d. Define the reward r(s) = 0 for all states (except state 5 whose rewards are specified above). Assume y = 0.8 as in part (c). How would the value function change? How would the policy change? Explain why.
e. All transitions are even better now:
each transition now has an extra reward of 1 in addition to the reward you defined in (a).
Assume y = 0.8 as in part (c). How would the value function change? How would the policy change? Explain why.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started