Question: ( 1 5 points ) Consider the following Markov Decision Process. Unlike most MDP models where actions have many potential outcomes of varying probability, assume

(15 points) Consider the following Markov Decision Process. Unlike most MDP models where actions have many potential outcomes of varying probability, assume that the transitions are deterministic, i.e. occur with T(s,a,s')=100% probability as shown in the graph below. The reward for each transition is displayed adjacent to each edge:
(1) Compute the following six quantities: V2(A),V**(A),V2(B),V**(B),V5(A),V5(B), assuming a discount factor of 1. Please also provide the computation process. (6 points)
(2) Re-compute the two quantities: V**(A),V**(B), assuming a discount factor of 0.9. Please also provide the computation process. (4 points)
(3) Consider using feature-based Q-learning to learn what the optimal policy should be. Suppose we use the feature f1(s,a) equal to the smallest number of edges from s to E, and feature f2(s,a) equal to the smallest number of edges from a's target to E. For example f1(A,B) is 2, and f2(C,D) is 1. Let the current weight w for these features be 3,2. What is the current prediction for Q(C,D) given these weights for these two features? (2 points)
(4) Suppose we update w based on a single observation of a transition from C to D, assuming learning rate =0.5. What is w after updating based on this observation? (3 points)
( 1 5 points ) Consider the following Markov

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!