Question
Consider training an MDP using the following sequence of states, actions, and rewards: S1, reward 0, action 1 S2, reward 10, action 1 S2, reward
Consider training an MDP using the following sequence of states, actions, and rewards:
S1, reward 0, action 1
S2, reward 10, action 1
S2, reward 10, action 2
S1, reward 0, action 1
S2, reward 10, action 2
S1, reward 0, action 2
S3, reward 0, action 1
S3, reward 0, action 1
S4, reward 100, action 1
S4, reward 100, action 2
S2, reward 10
(a) Suppose you use certainty equivalent learning to calculate the J values. Fill in the table below, using discount factor = 0.5.
State: | S1 | S2 | S3 | S4 |
J* Value: |
(b) Suppose you instead use Q-learning. Assume that all Q-values are initialized to 0. Fill in the table below to show how the Q-values change after the first six transitions, using discount factor = 0.5 and learning rate = 0.5.
State, Action Pair: | (S1, 1) | (S1, 2) | (S2, 1) | (S2, 2) | (S3, 1) | (S3, 2) | (S4, 1) | (S4, 2) |
Q-value at start: | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Q-value after Observing: S1, reward 0, action 1 S2 | ||||||||
Q-value after observing: S2, reward 10, action 1 S2 | ||||||||
Q-value after observing: S2, reward 10, action 2 S1 | ||||||||
Q-value after observing: S1, reward 0, action 1 S2 | ||||||||
Q-value after observing: S2, reward 10, action 2 S1 | ||||||||
S1, reward 0, action 2 S3 |
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started