Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 22, 2024

python: Description of Part III of Project In Part III of the project, you will train Q - learning agent to play Nim. The agent

python: Description of Part III of Project

In Part III of the project, you will train Q

-

learning agent to play Nim. The agent will be trained by playing thousands of games against a RandomPlayer agent, but will eventually be able to consistently defeat the better playing MininmaxPlayer agents.

Define Nim Class

Copy the definition for the Nim class from Part I of the project into the cell below.

New Classes

You will work with two new classes in this notebook. These are named BotPlayerEnv and PolicyPlayer. These classes are described below.

The BotPlayerEnv class provides an interface that can be used with reinforcement learning algorithms to train agents to play games by having them complete against a "bot player" controlled by an adversarial search agent

(

such as RandomPlayer

) .

Instances of BotPlayerEnv combine an instance of a game environment with an instance of an adversarial agent to create an environment that can be use with our RL algorithms. When an action is taken in this environment, the BotPlayerEnv class will apply that action, and then generate and apply an action for the bot player. The code block below demonstrates how to create an instance of BotPlayerEnv and how to use it with an instance of TDAgent.

nim

=

Nim

(

piles

= 3,

stones

= 9,

limit

= 5)

bot

=

RandomPlayer

('

Bot

')

bot

_

env

=

BotPlayerEnv

(

game

_

env

=

nim, agent

=

bot

)

=

TDAgent

(

bot

_

env, gamma

= 1,

random

_

state

= 1)

An instance of the PolicyPlayer class represents an adversarial search agent that follows a policy that maps game states to actions. We will use this class to create agents that follow the policies learned by applying the Q

-

learning algorithm. It is interesting to note that since PolicyPlayer agents don't have to perform a search when selecting actions, they will always select their actions very quickly. It might take a significant amount of time to run the Q

-

learning algorithm that learns the policy to use in conjunction with PolicyPlayer, but once the policy is learned, the agent will play very quickly.

The code block below demonstrates how to create an instance of PolicyPlayer.

1 =

PolicyPlayer

('

Policy Player', policy

=

some

_

policy

)

Part

1

: Basic Q

-

Learning Agent

In Part

1,

we will use Q

-

learning to learn a policy for playing Nim. The policy will be learned by having the Q

-

learning algorithm play many games against a RandomPlayer agent, and will be tested by having it play against RandomPlayer and MinimaxPlayer agents. Our eventual goal is to find a policy that can be used to consistently defeat a Minimax agent with a depth of

4 .

1 .

-

Training the Agent

Create the following objects:

An instance of Nim with

3

piles,

9

stones per pile, and with a limit of

5

stones per action.

An instance of RandomPlayer.

An instance of BotPlayerEnv using the Nim and RandomPlayer instances you created above.

An instance of TDAgent that uses your instance of BotPlayerEnv. Set gamma

= 1

and random

_

state

= 1 .

After creating the objects above, use your TDAgent instance to apply Q

-

learning to learn a policy for Nim. Run

10, 000

episodes of Q

-

learning with an exploration rate of

0.1

and a learning rate of

0.1 .

Also set track

_

history

=

False when calling calling the q

_

learning

()

method. This will significantly reduce the memory requirements of the algorithm.

1 .

-

Create Agents

Create the following agents:

A PolicyPlayer instance using the policy found by Q

-

learning.

A RandomPlayer instance.

A MinimaxPlayer instance with depth

= 2 .

A MinimaxPlayer instance with depth

= 3 .

A MinimaxPlayer instance with depth

= 4 .

1 .

-

Versus RandomPlayer

Run a

1000

round tournament between the PolicyPlayer agent and the RandomPlayer agent. Set random

_

state

= 1 .

When creating the agent list, please list the PolicyPlayer agent first.

1 .

-

Versus Minimax

(2)

Run a

1000

round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth

= 2 .

Set random

_

state

= 1 .

When creating the agent list, please list the PolicyPlayer agent first.

1 .

-

Versus Minimax

(3)

Run a

1000

round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth

= 3 .

Set random

_

state

= 1 .

When creating the agent list, please list the PolicyPlayer agent first.

1 .

-

Versus Minimax

(4)

Run a

1000

round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth

= 4 .

Set random

_

state

= 1 .

When creating the agent list, please list the PolicyPlayer agent first.

1 .

-

Summarizing Results

Indicate the win rates for the PolicyPlayer agent by filling in each of the blanks below. Proivde your answer as percentages rounded to

1

decimal place.

The policy player won:

____%

of games played against the RandomPlayer agent.

____%

of games played against the MinimaxPlayer agent with depth

2 .

____%

of games played against the MinimaxPlayer agent with depth

3 .

____%

of games played against the MinimaxPlayer agent with depth

4 .

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

The Complete Excel User Guide For Beginners And Seniors A Comprehensive Guide For Mastering Spreadsheets Formulas Functions Charts Facilitating Effortless Data Management And Analysis

Authors: Jackson Knight

1st Edition

Ascriptionachievement status: the extent to which it is appropriate to reward ascribed status or achievement in the job;

Answered: 1 week ago

Previous Question Next Question