Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

python: Description of Part III of Project In Part III of the project, you will train Q - learning agent to play Nim. The agent

python: Description of Part III of Project
In Part III of the project, you will train Q-learning agent to play Nim. The agent will be trained by playing thousands of games against a RandomPlayer agent, but will eventually be able to consistently defeat the better playing MininmaxPlayer agents.
Define Nim Class
Copy the definition for the Nim class from Part I of the project into the cell below.
New Classes
You will work with two new classes in this notebook. These are named BotPlayerEnv and PolicyPlayer. These classes are described below.
The BotPlayerEnv class provides an interface that can be used with reinforcement learning algorithms to train agents to play games by having them complete against a "bot player" controlled by an adversarial search agent (such as RandomPlayer). Instances of BotPlayerEnv combine an instance of a game environment with an instance of an adversarial agent to create an environment that can be use with our RL algorithms. When an action is taken in this environment, the BotPlayerEnv class will apply that action, and then generate and apply an action for the bot player. The code block below demonstrates how to create an instance of BotPlayerEnv and how to use it with an instance of TDAgent.
nim = Nim(piles=3, stones=9, limit=5)
bot = RandomPlayer('Bot')
bot_env = BotPlayerEnv(game_env=nim, agent=bot)
td = TDAgent(bot_env, gamma=1, random_state=1)
An instance of the PolicyPlayer class represents an adversarial search agent that follows a policy that maps game states to actions. We will use this class to create agents that follow the policies learned by applying the Q-learning algorithm. It is interesting to note that since PolicyPlayer agents don't have to perform a search when selecting actions, they will always select their actions very quickly. It might take a significant amount of time to run the Q-learning algorithm that learns the policy to use in conjunction with PolicyPlayer, but once the policy is learned, the agent will play very quickly.
The code block below demonstrates how to create an instance of PolicyPlayer.
p1= PolicyPlayer('Policy Player', policy=some_policy)
Part 1: Basic Q-Learning Agent
In Part 1, we will use Q-learning to learn a policy for playing Nim. The policy will be learned by having the Q-learning algorithm play many games against a RandomPlayer agent, and will be tested by having it play against RandomPlayer and MinimaxPlayer agents. Our eventual goal is to find a policy that can be used to consistently defeat a Minimax agent with a depth of 4.
1.A - Training the Agent
Create the following objects:
An instance of Nim with 3 piles, 9 stones per pile, and with a limit of 5 stones per action.
An instance of RandomPlayer.
An instance of BotPlayerEnv using the Nim and RandomPlayer instances you created above.
An instance of TDAgent that uses your instance of BotPlayerEnv. Set gamma=1 and random_state=1.
After creating the objects above, use your TDAgent instance to apply Q-learning to learn a policy for Nim. Run 10,000 episodes of Q-learning with an exploration rate of 0.1 and a learning rate of 0.1. Also set track_history=False when calling calling the q_learning() method. This will significantly reduce the memory requirements of the algorithm.
1.B - Create Agents
Create the following agents:
A PolicyPlayer instance using the policy found by Q-learning.
A RandomPlayer instance.
A MinimaxPlayer instance with depth=2.
A MinimaxPlayer instance with depth=3.
A MinimaxPlayer instance with depth=4.
1.C - Versus RandomPlayer
Run a 1000 round tournament between the PolicyPlayer agent and the RandomPlayer agent. Set random_state=1. When creating the agent list, please list the PolicyPlayer agent first.
1.D - Versus Minimax(2)
Run a 1000 round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth=2. Set random_state=1. When creating the agent list, please list the PolicyPlayer agent first.
1.E - Versus Minimax(3)
Run a 1000 round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth=3. Set random_state=1. When creating the agent list, please list the PolicyPlayer agent first.
1.F - Versus Minimax(4)
Run a 1000 round tournament between the PolicyPlayer agent and the MinimaxPlayer agent with depth=4. Set random_state=1. When creating the agent list, please list the PolicyPlayer agent first.
1.G - Summarizing Results
Indicate the win rates for the PolicyPlayer agent by filling in each of the blanks below. Proivde your answer as percentages rounded to 1 decimal place.
The policy player won:
____% of games played against the RandomPlayer agent.
____% of games played against the MinimaxPlayer agent with depth 2.
____% of games played against the MinimaxPlayer agent with depth 3.
____% of games played against the MinimaxPlayer agent with depth 4.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions