Question

1 Approved Answer

Posted on Aug 29, 2024

Objectives: To implement a reinforcement learning algorithm that can learn a policy for a given task based on task-based rewards To take a continuous environment

Objectives: To implement a reinforcement learning algorithm that can learn a policy for a given task based on task-based rewards To take a continuous environment and discretize it so that it is suitable for a reinforcement learning task This is the CartPole task. The idea here is to balance this pole using a one-dimensional robot (it can only move left and right). The robot's state has 4 components: x: the location of the robot (0 is the center, -2.4 is the leftmost part of the board, 2.4 is the rightmost part of the board) xdot: the velocity of the robot (technically, this can go from -inf to inf) theta: the angle that the pole is at (0 is straight up, -12 degrees or fewer means that the pole falls to the left, 12 degrees or more means that the pole falls to the right) thetadot: the change in angle per second. The robot can choose between two actions: 0: move left 1: move right Success is balancing for 500 ticks, failure is if the stick falls more than 12 degrees from the median, or if the robot moves more than 2.4 meters from the center. Your first task is to make a robot learn this task using reinforcement learning (Q-learning). OpenAI Gym: You do not have to implement the problem domain yourself, there is a resource called openAI gym which has a set of common training examples. Gym can be installed with the following command: > sudo pip3 install gym After running the provided command, you may also be asked to install some additional packages for the video encoding. You'll see an error message with instructions to follow. State Discretization: We will discretize the space in order to simplify the reinforcement learning algorithm. One example can be as follows: x: (one bucket for < -.08, one for -.08 < x < .08, one for > .08) xdot: (one bucket for <-.5, one for -.5 < xdot < .5, one for > .5) theta: divided into six buckets, separated by: -6deg, -1deg, 0, 1deg, 6deg thetadot divided into three buckets, separated by: -50deg/s, 50deg/s These state components are combined into a single integer state (0 ... 161 for this example). This is done for you in the function discretize_state in the provided code. Your Task (part 1): You need to implement the q-learning part of this task (right now the template code will not do any learning). You need to implement the following equation from the lecture slides: Q-values can be learned directly from reward feedback Q(a,i) Q(a,i) + (R(i) + * maxa'Q(a',j) - Q(a,i)) The magic happens (or right now does not happen) on line 112 of cart.py. In this case, the discretized state (s) would be i in this equation, and the next discretized state (sprime) would be j. Reward is stored in the variable reward, and the learning rate () is in the variable alpha, which is set initially in the code. The predicted value of the next state, maxa'Q(a',j), is already computed and stored in the variable predicted_value. Exploration vs Exploitation: Many times the Q learning algorithm may converge to a local optima rather than global optima. To overcome this issue, an epsilon greedy strategy leveraging exploration and exploitation concept is employed such that the algorithm explores the action state space before deciding to choose a route that converges to local optima. Exploration allows an agent to improve the current knowledge of actions resulting in policy that offers long-term benefits. Exploitation, on the other hand, follows the greedy policy based on the action that provides maximum Q value. During the learning phase we want the agent to simultaneously explore as well as exploit the current knowledge it has about the environment. Here the Epsilon-greedy strategy comes into play. Epsilon-greedy action selection: This is a simple method to balance exploration and exploitation by choosing between random actions and optimal action (based on current policy). The pseudo code is as follows: p = random() If p 0.5 print("Your highest attained position = {}".format(max_[0])) print("Position threshold for success >= {}".format(0.5)) self.assertEqual(result,True) unittest.main() from cart import CartPole import unittest import numpy as np class TestTicTacToe(unittest.TestCase): # def test_init_board(self): # ttt = TicTacToe3D() # # brd,winner = ttt.play_game() # self.assertEqual(ttt.board.shape, (3,3,3)) def test_1(self): player_first = 1 expected_winner = 1 env_id = 'CartPole-v1' cartpole = CartPole(env_id, False, True, 'cart.npy') all_states = cartpole.run() max_ = np.max(all_states, axis=0) print("max = {}".format(max_)) result_1 = max_[0] <= 2.4 result_2 = max_[2] <= 0.226893 result = result_1 and result_2 print("Your max cart position = {}".format(max_[0])) print("Your max pole angle = {}".format(max_[2])) print("Cart position for success <= {}".format(2.4)) print("Pole angle for success <= {} radians".format(0.226893)) self.assertEqual(result,True) unittest.main()