Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 06, 2024

1 . Consider the following Markov decision process, with the gridworld and transition function as illustrated below. The states are grid squares, identified by their

$1 .$ Consider the following Markov decision process, with the gridworld and transition function as illustrated

below. The states are grid squares, identified by their row and column number $($ row first $) .$ The agent

always starts in state $(1, 1),$ marked with the letter $S .$ There are two terminal goal states, $(2, 3)$ with reward

$+ 5$ and $(1, 3)$ with reward $- 5 .$ Rewards are $0$ in non $-$ terminal states. $($ The reward for a state is received

as the agent moves into the state $)$ The transition function is such that the intended agent movement

$($ North $,$ South, West, or East $)$ happens with probability $0.8 .$ With probability $0.1$ each, the agent ends up

in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent

stays in the same state. $($ a $)$ Gridworld MDP $. ($ b $)$ Transition function.

$($ a $)$ Draw the optimal policy for this grid.

$($ b $)$ Suppose the agent knows the transition probabilities. Give the first two rounds of value iteration

updates for each state, with a discount of $0.9 . ($ Assume $V_{0}$ is $0$ everywhere and compute $V_{1}$ for times

$i = 1, 2 .$

$($ c $)$ Suppose the agent does not know the transition probabilities. What does it need $($ or must it have

available $)$ in order to learn the optimal policy?

$($ d $)$ The agent starts with the policy that always chooses to go right, and executes the following three

trials:

$(1, 1) - (1, 2) - (1, 3),$

$(1, 1) - (1, 2) - (2, 2) - (2, 3),$ and

$(1, 1) - (2, 1) - (2, 2) - (2, 3) .$

What are the monte carlo $($ direct utility $)$ estimates for states $(1, 1)$ and $(2, 2),$ given these traces? $(4)$

$($ e $)$ Using a learning rate of $0.1$ and assuming initial values of $0,$ what updates does the TD $-$ learning agent

make after trials $1$ and $2,$ above? First give the TD $-$ learning update equation, and then provide the

updates after the two trials.

Consider the MDP below, in which there are two states, $x$ and $Y,$ two sctions, right and left, and the

deterministic rewards on each transition are as indicated by the numbers. Note that if action right is

taken in state $x,$ then the transition may be either to $x$ with a reward of $+ 2$ or to $Y$ with a reward of $- 2 .$

These two possibilities occur with probabilities $\frac{2}{3} ($ for the transition to $x)$ and $\frac{1}{3} ($ for the transition

to state $Y .$

Consider two deterministic policies:

$_{1} (x) =$ left, $_{1} (Y) =$ right

$_{2} (x) =$ right, $_{2} (Y) =$ right

$($ a $)$ Show a typical trajectory for policy $_{1}$ from state $x .$

$($ b $)$ Show a typical trajectory for policy $_{2}$ from state $x .$

$($ c $)$ Assuming $= 0.5,$ the value of state $Y$ under policy is $_{1}$ :

$($ d $)$ Assuming $= 0.5,$ the action $-$ value of $x,$ left under policy $_{1}$ is:

The plot below shows the target value per time step as a grey line. Apply the equation below

to determine the estimates for time step $2$ to $13 .$ Draw your answers on the graph below, where the first

value is provided in blue. Show all your calculations.

$Q_{n + 1} = Q_{n} + [R_{n} - Q_{n}]$

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

Suppose that a response variable Y is related to four predictor variables, X1, X2, X3, and x4, so that k = 4. 1. Enter the observed values of y and each of the k = 4 predictor variables into the...

Answered: 1 week ago

Question

★★★★★

For each of the following sets of supply and demand curves, calculate equilibrium price and quantity. a. QD = 2000-2P; QS = 2P b. QD = 500 P; QS = 50 + P c. QD = 5000 10P; QS = 1000 + 5P

Answered: 1 week ago

Question

★★★★★

On January 1, Year 2, Zook Company had 26,000 shares of $1 par common stock outstanding. During October, Year 2, the companys board of directors declared and distributed a 1% common stock dividend...

Answered: 1 week ago

Question

★★★★★

4. The store received 1200 coats. 400 coats at full retail of $300 and the store took a series of markdowns on the balance of the coats. The first cut was to $199.99 and they sold 600 coats at new...

Answered: 1 week ago

Question

★★★★★

Finally, 21st Century is also considering Project Z Project Z has an up-front after-tax cost of $500,000, and it is expected to produce after-tax cash flows of $100,000 at the end of each of the next...

Answered: 1 week ago

Question

★★★★★

2. A liquid is heated in a lab using a hot plate. The graph of temperature vs. time is as follows. Temperature of Solution Temperature (C) 888R 8 8 8 8 8 9 110 200 10 20 40 50 60 70 90 Time (min)...

Answered: 1 week ago

Question

★★★★★

3. Develop a database program that creates a database named cities.db. The cities.db database must have a table named Cities, with the following columns: Column Name Data Type I Attributes City Name...

Answered: 1 week ago

Question

★★★★★

Prior to the Holiday season, Toys-R-Us would increase their on-hand inventory levels. What is the rational for this (see chapter 9)? What are some of the reasons why Toys-R-Us would not purchase more...

Answered: 1 week ago

Question

★★★★★

Laker Company reported the following January purchases and sales data for its only product. For specific identification, ending inventory consists of 275 units from the January 30 purchase, 5 units...

Answered: 1 week ago

Previous Question Next Question