Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 28, 2024

Problem Statement Develop a reinforcement learning agent using dynamic programming methods to solve the Dice game optimally. The agent will learn the optimal policy by

Problem Statement

Develop a reinforcement learning agent using dynamic programming methods to solve the Dice game optimally. The agent will learn the optimal policy by iteratively evaluating and improving its strategy based on the state

-

value function and the Bellman equations.

Scenario:

A player rolls a

6 -

sided die with the objective of reaching a score of exactly

100 .

On each turn, the player can choose to stop and keep their current score or continue rolling the die. If the player rolls a

1,

they lose all points accumulated in that turn and the turn ends. If the player rolls any other number

(2 - 6),

that number is added to their score for that turn. The game ends when the player decides to stop and keep their score OR when the player's score reaches

100 .

The player wins if they reach a score of exactly

100,

and loses if they roll a

1

when their score is below

100 .

Environment Details

The environment consists of a player who can choose to either roll a

6 -

sided die or stop at any point. The player starts with an initial score

(

.

., 0)

and aims to reach a score of exactly

100 .

If the player rolls a

1,

they lose all points accumulated in that turn and the turn ends. If they roll any other number

(2 - 6),

that number is added to their score for that turn. The goal is to accumulate a total of exactly

100

points to win, or to stop the game before reaching

100

points.

States

State s: Represents the current score of the player, ranging from

0

100 .

Terminal States:

State s

= 100

: Represents the player winning the game by reaching the goal of

100

points.

State s

= 0

: Represents the player losing all points accumulated in the turn due to rolling a

1 .

Actions

Action a: Represents the decision to either "roll" the die or "stop" the game at the current score.

The possible actions in any state s are either "roll" or "stop".

Outcomes:

1 .

Use dynamic programming methods value iteration, policy improvement and policy evaluation to find the optimal policy for the Dice Game.

2 .

Implement an epsilon

-

greedy policy for action selection during training to balance exploration and exploitation.

3 .

Evaluate the agent's performance in terms of the probability of reaching exactly

100

points after learning the optimal policy.

4 .

Use the agent's policy as the best strategy for different betting scenarios within the problem.

5 .

Following is the comment given for Value

-

Iteration Function

-

# Iterate over all states except terminal state untill convergence

# Calculate expected returns V

(

)

for current policy by considering all possible actions.

#If action is stop:

#Calculate reward for stopping and append to rewards.

#If action is roll:

#For each possible roll outcome

(1

6),

Determine next

_

s based on roll.

# Update V

(

)

using the Bellman equation.

#Determine max

_

reward from rewards

#With probability epsilon, randomly choose a reward from rewards.

#Check convergence if delta is less than a small threshold.

- - - - -

write your code below this line

- - - - - - - - -

6 .

Following is the comment given for Policy Iteration

-

#For each state, Store old

_

policy of state s

.

#Determine best

_

action based on maximum reward. Update policy

[

]

to best

_

action.

#Return stable when old policy

=

policy

[

]

- - - - -

write your code below this line

- - - - - - - - -

7 .

Comment given for Execute Policy Iteration & Value Iteration :

#Simulate the game for

100

states. Use the learned policy to get the actions.

#when its roll, randomly generate a number to find the reward.

#when its stop, get the respective reward

#determine the total cumulative reward

- - - - -

write your code below this line

- - - -

- - - - -

8 .

Need to design a DiceGame Environment:

# Code for Dataset loading and print dataset statistics along with reward function

- - - - -

write your code below this line

- - - - - - - - -

class DiceGameEnvironment:

9 .

Reward Function

-

#Calculate reward function for 'stop' and 'roll' actions

- - - - -

write your code below this line

- - - - - - - - -

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Professional SQL Server 2012 Internals And Troubleshooting

Authors: Christian Bolton, Justin Langford

1st Edition

Based on your understanding of the various components of direct financial compensation (Learning Objective 9-2 ), which approach is most fitting for recognizing job incumbents contributions? Explain.

Answered: 1 week ago

Previous Question Next Question