In this section you will train two DQN models of Architecture type 1 , i e the DQN model should accept the state and the action as input and the output of the model should be the Q value of the state action pair given in the input The first DQN model should be without the data collected in step 3 and the second one uses the data Deliverables ( 7 5 marks ) You are given a Python script training py This script contains the bare basic skeleton of the DQN training code along with a function that loads the data collected in step 3 You must NOT change the overall structure of the skeleton There are two functions in training py DQN training and plot reward Your task is to write the code for these two functions Few additional instructions 1 This function MUST train DQN of architecture 1 ( the DQN model should accept the state and the action as input and the output of the model should be the Q value of the state action pair given in the input ) for lunar lander environment The output of the function is the final trained model, and a Numpy array containing total reward per episode VERY IMPORTANT If you are coding DQN model of Architecture type 2 ( i e the DQN model that accepts state as input and the output is Q value of all the state action pair ) , you will get a ZERO for this section There will be NO MERCY in this regard 2 In function has an argument called use offline data If this variable is True then the data collected in step 3 should be used for training else not If use offline data False, then it is business as usual, i e your code will be similar to that in the resources folder If use offline data True, then we will initialize the replay buffer with the data collected offline For the first episodes, you will NOT append any data collected from the interaction with the environment onto the replay buffer After episodes, the data collected from the interaction with the environment should be appended to the replay buffer should not be greater than 1 0 0 The exact value of is an hyperparameter In many ways it depends on how good the data collected in step 3 is If you got high total rewards in step 3 , can be high In this regard note that one of the argument of the function load offline data is min score Only those episodes play will be loaded whose total reward is min score So , the higher the min score, the quality of data increases but the amount of data decreases By default, min score is set to infty and hence all the episodes are loaded 3 The final trained model should be saved and submitted You should submit only one model, either for use offline data False or use offline data True Submit the one that is performing better The size of your model should not be greater than 2 MB 4 The function plot reward should plot the following in the same graph ( i ) total reward per episode, and ( ii ) moving average of the total reward The plots for both use offline data False and use offline data True should be included in report pdf The following points may also be of significant help 1 Don t forget to save the model periodically using an automated code This is because your laptop, Kaggle Colab notebooks can switch off ( or go to sleep ) if there is prolonged inactivity which is often the case when you are training a model for a long time If you save your model, you can load it and start from where your progress stopped 2 Don t used any complicated neural network model It will take a lot of time to train it A neural network model with 2 3 hidden layers and not more than 1 2 8 neurons per hidden layer is more than enough In fact, 1 2 8 neurons is too much and so is the size of 2 MB mentioned move The size of my model is less than 1 0 0 KB 3 GPU will NOT increase the training speed significantly This is one of the curse of Deep RL ( unless you are using advanced techniques like multi agent RL ) 4 Remember that action is one of the inputs to the model Think whether this input should be ordinal or one hot encoded give me the entire code for this in lunar lander env and specifically the achitecture mentioned that is dqn 1 and make sure the reward is increasing as episodes increase

Question

In this section you will train two DQN models of Architecture type 1 , i   e   the DQN model should accept the state and the action as input and the output of the model should be the Q   value of the state   action pair given in the input  The first DQN model should be without the data collected in step 3 and the second one uses the data  Deliverables ( 7 5 marks )   You are given a Python script training py   This script contains the bare basic skeleton of the DQN training code along with a function that loads the data collected in step 3   You must NOT change the overall structure of the skeleton  There are two functions in training py  DQN   training and plot   reward  Your task is to write the code for these two functions  Few additional instructions  1   This function MUST train DQN of architecture 1 ( the DQN model should accept the state and the action as input and the output of the model should be the Q   value of the state   action pair given in the input ) for lunar lander environment  The output of the function is the final trained model, and a Numpy array containing total reward per episode  VERY IMPORTANT  If you are coding DQN model of Architecture type 2 ( i   e   the DQN model that accepts state as input and the output is Q   value of all the state   action pair ) , you will get a ZERO for this section  There will be NO MERCY in this regard  2   In function has an argument called use   offline   data  If this variable is True then the data collected in step 3 should be used for training  else not  If use   offline   data   False, then it is business as usual, i   e   your code will be similar to that in the resources folder  If use   offline   data   True, then we will initialize the replay buffer with the data collected offline  For the first episodes, you will NOT append any data collected from the interaction with the environment onto the replay buffer  After episodes, the data collected from the interaction with the environment should be appended to the replay buffer  should not be greater than 1 0 0   The exact value of is an hyperparameter  In many ways it depends on how good the data collected in step 3 is   If you got high total rewards in step 3 , can be high  In this regard note that one of the argument of the function load   offline   data is min   score  Only those episodes   play   will be loaded whose total reward is     min   score  So , the higher the min   score, the quality of data increases but the amount of data decreases  By default, min   score is set to   infty and hence all the episodes are loaded  3   The final trained model should be saved and submitted  You should submit only one model, either for use   offline   data   False or use   offline   data   True  Submit the one that is performing better  The size of your model should not be greater than 2 MB   4   The function plot   reward should plot the following in the same graph  ( i ) total reward per episode, and ( ii ) moving average of the total reward  The plots for both use   offline   data   False and use   offline   data   True should be included in report pdf   The following points may also be of significant help  1   Don t forget to save the model periodically using an automated code  This is because your laptop, Kaggle   Colab notebooks can switch off ( or go to sleep ) if there is prolonged inactivity which is often the case when you are training a model for a long time  If you save your model, you can load it and start from where your progress stopped  2   Don t used any complicated neural network model  It will take a lot of time to train it   A neural network model with 2   3 hidden layers and not more than 1 2 8 neurons per hidden layer is more than enough  In fact, 1 2 8 neurons is too much and so is the size of 2 MB mentioned move  The size of my model is less than 1 0 0 KB   3   GPU will NOT increase the training speed significantly  This is one of the curse of Deep RL ( unless you are using advanced techniques like multi   agent RL )   4   Remember that action is one of the inputs to the model  Think whether this input should be ordinal or one   hot   encoded  give me the entire code for this in lunar lander env and specifically the achitecture mentioned   that is dqn 1 and make sure the reward is increasing as episodes increase

Accepted Answer

The Answer is in the image, click to view ...

Question

In this section you will train two DQN models of Architecture type 1 , i . e . the DQN model should accept the state

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Understanding Databases Concepts And Practice

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question