Question: ( 1 5 points ) Consider the following Markov Decision Process. Unlike most MDP models where actions have many potential outcomes of varying probability, assume

(15

points

)

Consider the following Markov Decision Process. Unlike most MDP models where actions have many potential outcomes of varying probability, assume that the transitions are deterministic, i

.

.

occur with

T (s, a, s^{'}) = 100 %

probability as shown in the graph below. The reward for each transition is displayed adjacent to each edge:

(1)

Compute the following six quantities:

V_{2} (A), V^{* *} (A), V_{2} (B), V^{* *} (B), V_{5} (A), V_{5} (B),

assuming a discount factor of

1 .

Please also provide the computation process.

(6

points

)

(2)

-

compute the two quantities:

V^{* *} (A), V^{* *} (B),

assuming a discount factor of

0.9 .

Please also provide the computation process.

(4

points

)

(3)

Consider using feature

-

based Q

-

learning to learn what the optimal policy should be

.

Suppose we use the feature

f_{1} (s, a)

equal to the smallest number of edges from

s

E,

and feature

f_{2} (s, a)

equal to the smallest number of edges from

a'

s target to

E .

For example

f_{1} (A, B)

2,

and

f_{2} (C, D)

1 .

Let the current weight

w

for these features be

3, 2 .

What is the current prediction for

Q (C, D)

given these weights for these two features?

(2

points

)

(4)

Suppose we update

w

based on a single observation of a transition from C to D

,

assuming learning rate

= 0.5 .

What is

w

after updating based on this observation?

(3

points

)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Microkernel operating systems aim to address perceived modularity and reliability issues in traditional "monolithic" operating systems. (i) Describe the typical architecture of a microkernel...

"Fortran, Algol and Lisp invented most programming language concepts 50 years ago; adding the concept of object-orientation suffices to explain all programming languages to date". To what extent is...

Describe how to construct the function cpo ((D E), v) of two cpos (D, vD) and (E, vE). Prove that ((D E), v) is a cpo. (You may use facts about least upper bounds provided you state them clearly.)...

[Solutions to this assignment must be submitted vio CANVAS prior to midnight on the due dote. These dates and times vory depending on the milestone to be submitted. Submissions up to one day late...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

Prolog You are approached to compose a Prolog program to work with twofold trees. Your code shouldn't depend on any library predicates and you ought to expect that the mediator is running without...

nodes, but at least its bias can be quantified by Markov Chain L. INTRODUCTION analysis and thus can be corrected via appropriate re-weighting The popularity of online social networks (OSNs) in...

Briefly describe ASCII and Unicode and draw attention to any relationship between them. [3 marks] (b) Briefly explain what a Reader is in the context of reading characters from data. [3 marks] A...

Examine the pricing strategies in the gasoline market. Make sure to address the following topics: In the article, (Mixed) Strategy in Oligopoly Pricing: Evidence from Gasoline Price Cycles Before and...

Delphi Ltd. has recently decided to enter the expanding market for minidisc players. The business will manufacture the players and sell them to small TV and hi-fi specialists, medium-sized music...

Helen Hutchins and Greg Haglund took the elevator together to the fourth-floor meeting room, where they were scheduled to meet the rest of the market research team at the Franklin Company. On the way...

Fleet of car and van: As our company currently uses the cost price method for our fleet of cars and van, there will be no immediate impact on our tax calculation as a result of this change. If our...

(Prepared from a situation suggested by Professor John W. Hardy.) Lone Star Meat Packers is a major processor of beef and other meat products. The company has a large amount of T-bone steak on hand,...