Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Let us define a gridworld MDP , depicted in Figure 2 . The states are grid squares, identified by their row and column number (
Let us define a gridworld MDP depicted in Figure The states are grid squares, identified
by their row and column number row first The agent always starts in state marked
with the letter S There are two terminal goal states, with reward and with
reward Rewards are in nonterminal states. The reward for a state is received as
the agent moves into the state. The transition function is such that the intended agent
movement Up Down, Left, or Right happens with a probability of With a probability
of each, the agent ends up in one of the states perpendicular to the intended direction.
If a collision with a wall happens, the agent stays in the same state.
Figure : Left: Gridworld MDP Right: Transition function
a
Define the optimal policy for this gridworld MDP
b
Suppose the agent does not know the transition probabilities. What does
the agent need to do in order to learn the optimal policy?
c
The agent starts with the policy that always chooses to go right, and exe
cutes the following three trajectories:
and What are the FirstVisit Monte Carlo estimates for
states and given these trajectories? Suppose
d
Using a learning rate of and assuming initial values of what
updates does the TDlearning agent make after trials and above? For this part,
suppose please anwser all questions in detail. Thank you.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started