Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Question 6 . ( 6 marks ) Consider the following MDP: the set of states is S = { s 0 , s 1 ,

Question 6.(6 marks)
Consider the following MDP: the set of states is S={s0,s1,s2,s3} and the set of actions available at each
state is A={l,r}. Each episode of the MDP starts in s1 and terminates in s0.
You do not know the transition probabilities or the reward function of the MDP, so you are using Sarsa
to find the optimal policy. Suppose the current Q-values are:
Q(s0,l)=0,Q(s0,r)=0
Q(s1,l)=3.4,Q(s1,r)=-1.8
Q(s2,l)=-0.8,Q(s2,r)=-0.7
Q(s3,l)=-0.5,Q(s3,r)=7.5
Suppose the next episode is as follows:
s1,l,-1,s1,r,-1,s2,l,-1,s1,l,10,s0.
(a)(4 marks) Do all the Sarsa updates to the Q-values that would result from this episode, using =0.25
and =0.9. Show your working.
(b)(1 mark) Based on the updated Q-values, give the final policy determined by Q, i.e., give (s1),(s2)
and (s3). Show your working.
(c)(1 mark) Give an lon-greedy policy based on the Q-values obtained in (a).
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions