Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1 . Consider the following Markov decision process, with the gridworld and transition function as illustrated below. The states are grid squares, identified by their

1. Consider the following Markov decision process, with the gridworld and transition function as illustrated
below. The states are grid squares, identified by their row and column number (row first). The agent
always starts in state (1,1), marked with the letter S. There are two terminal goal states, (2,3) with reward
+5 and (1,3) with reward -5. Rewards are 0 in non-terminal states. (The reward for a state is received
as the agent moves into the state) The transition function is such that the intended agent movement
(North, South, West, or East) happens with probability 0.8. With probability 0.1 each, the agent ends up
in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent
stays in the same state.(a) Gridworld MDP.(b) Transition function.
(a) Draw the optimal policy for this grid.
(b) Suppose the agent knows the transition probabilities. Give the first two rounds of value iteration
updates for each state, with a discount of 0.9.(Assume V0 is 0 everywhere and compute V1 for times
i=1,2.
(c) Suppose the agent does not know the transition probabilities. What does it need (or must it have
available) in order to learn the optimal policy?
(d) The agent starts with the policy that always chooses to go right, and executes the following three
trials:
(1,1)-(1,2)-(1,3),
(1,1)-(1,2)-(2,2)-(2,3), and
(1,1)-(2,1)-(2,2)-(2,3).
What are the monte carlo (direct utility) estimates for states (1,1) and (2,2), given these traces? (4)
(e) Using a learning rate of 0.1 and assuming initial values of 0, what updates does the TD-learning agent
make after trials 1 and 2, above? First give the TD-learning update equation, and then provide the
updates after the two trials.
Consider the MDP below, in which there are two states, x and Y, two sctions, right and left, and the
deterministic rewards on each transition are as indicated by the numbers. Note that if action right is
taken in state x, then the transition may be either to x with a reward of +2 or to Y with a reward of -2.
These two possibilities occur with probabilities 23(for the transition to x) and 13(for the transition
to state Y.
Consider two deterministic policies:
1(x)= left, 1(Y)= right
2(x)= right, 2(Y)= right
(a) Show a typical trajectory for policy 1 from state x.
(b) Show a typical trajectory for policy 2 from state x.
(c) Assuming =0.5, the value of state Y under policy is 1 :
(d) Assuming =0.5, the action-value of x, left under policy 1 is:
The plot below shows the target value per time step as a grey line. Apply the equation below
to determine the estimates for time step 2 to 13. Draw your answers on the graph below, where the first
value is provided in blue. Show all your calculations.
Qn+1=Qn+[Rn-Qn]
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions