Question

1 Approved Answer

Posted on Aug 27, 2024

4 Markov Decision Processes Consider the following game. In each turn you have a choice of rolling a special die, or stopping the game. The

image text in transcribed

4 Markov Decision Processes Consider the following game. In each turn you have a choice of rolling a special die, or stopping the game. The die is biased - every time you roll, it produces 1, 3, 5 or 6 with equal probability. No other values are possible (It's a tetrahedral die.) At any point of time, you can either roll or stop if the total "score" (obtained by adding the values on the die from every rolling) is less than 7. If the "score" reaches or exceeds 7, you "go bust and go to the final state, accruing zero reward. When in any state other than the final state, you are allowed to take the stop action. When you stop, you reach the final state and your reward is the total "score" if it is less than 7. Note: there is no direct reward from rolling the dice (or we could say that there is a reward but it's always 0). The only non-zero reward comes from explicitly taking the stop action. Discounting or not should not matter in the MDP for this game, but for the record, we assume no discounting i.e., y = 1). Figure 1: The value of a tetrahedral die like this, after a roll, is at the top, here 5, which should show equally well on any of the three faces that touch the top vertex. (a) (6 points) Write down the states in any order) and actions for this MDP. (Hint: there are 8 states in total and each should correspond to a numeric value except the initial and final states) (b) (10 points) Give the full transition function T(s, a, s'). Here s is a current state, a is an action, and d' is a possible next state when a is performed in s. Assuming your states are 80, 81, 82, 83 etc., and actions are do, a etc., some examples of how you should write the function are as follows: T(s, a, s') = (value); s = $0, s' {81, 82, 83, ...} T($0,21,81) = (value) (c) (2 points) Give the full reward function R(s, a, s'). (d) (2 points) What is the optimal policy? There is no need to perform value iteration or use any fancy math; just write your answer in words. 4 Markov Decision Processes Consider the following game. In each turn you have a choice of rolling a special die, or stopping the game. The die is biased - every time you roll, it produces 1, 3, 5 or 6 with equal probability. No other values are possible (It's a tetrahedral die.) At any point of time, you can either roll or stop if the total "score" (obtained by adding the values on the die from every rolling) is less than 7. If the "score" reaches or exceeds 7, you "go bust and go to the final state, accruing zero reward. When in any state other than the final state, you are allowed to take the stop action. When you stop, you reach the final state and your reward is the total "score" if it is less than 7. Note: there is no direct reward from rolling the dice (or we could say that there is a reward but it's always 0). The only non-zero reward comes from explicitly taking the stop action. Discounting or not should not matter in the MDP for this game, but for the record, we assume no discounting i.e., y = 1). Figure 1: The value of a tetrahedral die like this, after a roll, is at the top, here 5, which should show equally well on any of the three faces that touch the top vertex. (a) (6 points) Write down the states in any order) and actions for this MDP. (Hint: there are 8 states in total and each should correspond to a numeric value except the initial and final states) (b) (10 points) Give the full transition function T(s, a, s'). Here s is a current state, a is an action, and d' is a possible next state when a is performed in s. Assuming your states are 80, 81, 82, 83 etc., and actions are do, a etc., some examples of how you should write the function are as follows: T(s, a, s') = (value); s = $0, s' {81, 82, 83, ...} T($0,21,81) = (value) (c) (2 points) Give the full reward function R(s, a, s'). (d) (2 points) What is the optimal policy? There is no need to perform value iteration or use any fancy math; just write your answer in words