Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

python help: Part 2 : Comparing Control Methods I In Part 2 , you will compare the performance of Monte Carlo control and Q -

python help:

Part

2

: Comparing Control Methods I

In Part

2,

you will compare the performance of Monte Carlo control and Q

-

learning by running both algorithms on a small environment and then analyzing their progress toward finding the optimal policy and optimal state value function. For the sake of comparison, we will use value iteration to find the optimal policy.

________________________________________

2 .

-

Value Iteration

Create a

3

3

instance of the FrozenPlatform environment with sp

_

range

= [0.2, 0.6],

a start position of

1 (

which is the default

),

no holes, and with random

_

state

= 1 .

Create an instance of the DPAgent class with gamma

= 1

and random

_

state

= 1

and use it to run value iteration to find the optimal policy for the environment.

Display the environment, setting fill to shade the the cells according to their value under the optimal policy, and setting contents to display the optimal policy.

________________________________________

[]

________________________________________

2 .

-

MC Control

Create an instance of MCAgent class for the environment created in Step

2 .

,

setting gamma

= 1

and random

_

state

= 1 .

Do NOT set a policy for the agent, instead allowing the initial policy to be randomly generated.

Run Monte Carlo control with

20, 000

episodes, setting epsilon

= 0.1

and alpha

= 0.001 .

Then calculate the mean absolute difference between the optimal state

-

value function found by value iteration and the current Monte Carlo estimate. Print the following message with the blank filled in with the appropriate value, rounded to

2

decimal places

The mean absolute difference in V is

____.

________________________________________

[]

________________________________________

2 .

-

Display the Policy

Display the environment from

2 .

,

setting fill to shade the the cells according to their value under the policy found by MC control, and set contents to display that policy.

________________________________________

[]

________________________________________

2 .

-

History Plot

Replace the first blank in the cell below the instance of MCAgent created in

2 .

.

Set the target parameter of the method to be equal to the state

-

value function for the optimal policy

(

as found by value iteration

)

and then run this cell to show the history plot for the MC estimate of the state value function.

________________________________________

[]

________________________________________

2 .

-

-

Learning

Create an instance of the TDAgent class for the environment created in Step

2 .

,

setting gamma

= 1

and random

_

state

= 1 .

Do NOT set a policy for the agent, instead allowing the initial policy to be randomly generated.

Run Q

-

learning with

20, 000

episodes, setting epsilon

= 0.1

and alpha

= 0.001 .

Then calculate the mean absolute difference between the optimal state

-

value function found by value iteration and the current Q

-

learning estimate. Print the following message with the blank filled in with the appropriate value, rounded to

2

decimal places.

The mean absolute difference in V is

____.

________________________________________

[]

________________________________________

2 .

-

Display the Policy

Display the environment from

2 .

,

setting fill to shade the the cells according to their value under the policy found by Q

-

Learning, and set contents to display that policy.

________________________________________

________________________________________

2 .

-

History Plot

Replace the first blank in the cell below the instance of TDAgent created in

2 .

.

Set the target parameter of the method to be equal to the state

-

value function for the optimal policy

(

as found by value iteration

)

and then run this cell to show the history plot for the Q

-

Learning estimate of the state value function.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Oracle Database 11g SQL

Authors: Jason Price

1st Edition

★★★★★

Provide two examples of a One-To-Many relationship.

Answered: 1 week ago

Previous Question Next Question