Answered step by step
Verified Expert Solution
Question
1 Approved Answer
1 . Consider the following Markov decision process, with the gridworld and transition function as illustrated below. The states are grid squares, identified by their
Consider the following Markov decision process, with the gridworld and transition function as illustrated below. The states are grid squares, identified by their row and column number row first The agent always starts in state marked with the letter There are two terminal goal states, with reward and with reward Rewards are in nonterminal states. The reward for a state is received as the agent moves into the state The transition function is such that the intended agent movement North South, West, or East happens with probability With probability each, the agent ends up in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent stays in the same state.a Gridworld MDPb Transition function. a Draw the optimal policy for this grid. b Suppose the agent knows the transition probabilities. Give the first two rounds of value iteration updates for each state, with a discount of Assume is everywhere and compute for times c Suppose the agent does not know the transition probabilities. What does it need or must it have available in order to learn the optimal policy? d The agent starts with the policy that always chooses to go right, and executes the following three trials: and What are the monte carlo direct utility estimates for states and given these traces? e Using a learning rate of and assuming initial values of what updates does the TDlearning agent make after trials and above? First give the TDlearning update equation, and then provide the updates after the two trials. Consider the MDP below, in which there are two states, and two sctions, right and left, and the deterministic rewards on each transition are as indicated by the numbers. Note that if action right is taken in state then the transition may be either to with a reward of or to with a reward of These two possibilities occur with probabilities for the transition to and for the transition to state Consider two deterministic policies: left, right right, right a Show a typical trajectory for policy from state b Show a typical trajectory for policy from state c Assuming the value of state under policy is : d Assuming the actionvalue of left under policy is: The plot below shows the target value per time step as a grey line. Apply the equation below to determine the estimates for time step to Draw your answers on the graph below, where the first value is provided in blue. Show all your calculations.
Consider the following Markov decision process, with the gridworld and transition function as illustrated
below. The states are grid squares, identified by their row and column number row first The agent
always starts in state marked with the letter There are two terminal goal states, with reward
and with reward Rewards are in nonterminal states. The reward for a state is received
as the agent moves into the state The transition function is such that the intended agent movement
North South, West, or East happens with probability With probability each, the agent ends up
in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent
stays in the same state.a Gridworld MDPb Transition function.
a Draw the optimal policy for this grid.
b Suppose the agent knows the transition probabilities. Give the first two rounds of value iteration
updates for each state, with a discount of Assume is everywhere and compute for times
c Suppose the agent does not know the transition probabilities. What does it need or must it have
available in order to learn the optimal policy?
d The agent starts with the policy that always chooses to go right, and executes the following three
trials:
and
What are the monte carlo direct utility estimates for states and given these traces?
e Using a learning rate of and assuming initial values of what updates does the TDlearning agent
make after trials and above? First give the TDlearning update equation, and then provide the
updates after the two trials.
Consider the MDP below, in which there are two states, and two sctions, right and left, and the
deterministic rewards on each transition are as indicated by the numbers. Note that if action right is
taken in state then the transition may be either to with a reward of or to with a reward of
These two possibilities occur with probabilities for the transition to and for the transition
to state
Consider two deterministic policies:
left, right
right, right
a Show a typical trajectory for policy from state
b Show a typical trajectory for policy from state
c Assuming the value of state under policy is :
d Assuming the actionvalue of left under policy is:
The plot below shows the target value per time step as a grey line. Apply the equation below
to determine the estimates for time step to Draw your answers on the graph below, where the first
value is provided in blue. Show all your calculations.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access with AI-Powered Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started