Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this section you will train two DQN models of Architecture type 1 , i . e . the DQN model should accept the state

In this section you will train two DQN models of Architecture type 1, i.e. the DQN model should accept the state and the action as input and the output of the model should be the Q-value of the state-action pair given in the input. The first DQN model should be without the data collected in step 3 and the second one uses the data.
Deliverables (75 marks): You are given a Python script training.py. This script contains the bare basic skeleton of the DQN training code along with a function that loads the data collected in step 3. You must NOT change the overall structure of the skeleton. There are two functions in training.py: DQN_training and plot_reward. Your task is to write the code for these two functions. Few additional instructions:
1. This function MUST train DQN of architecture 1(the DQN model should accept the state and the action as input and the output of the model should be the Q-value of the state-action pair given in the input) for lunar lander environment. The output of the function is the final trained model, and a Numpy array containing total reward per episode.
VERY IMPORTANT: If you are coding DQN model of Architecture type 2(i.e. the DQN model that accepts state as input and the output is Q-value of all the state-action pair), you will get a ZERO for this section. There will be NO MERCY in this regard.
2. In function has an argument called use_offline_data. If this variable is True then the data collected in step 3 should be used for training; else not.
If use_offline_data=False, then it is business as usual, i.e. your code will be similar to that in
the resources folder.
If use_offline_data=True, then we will initialize the replay buffer with the data collected
offline. For the first episodes, you will NOT append any data collected from the interaction with the environment onto the replay buffer. After episodes, the data collected from the interaction with the environment should be appended to the replay buffer. should not be greater than 100. The exact value of is an hyperparameter. In many ways it depends on how good the data collected in step 3 is. If you got high total rewards in step 3, can be high. In this regard note that one of the argument of the function load_offline_data is min_score. Only those episodes/play # will be loaded whose total reward is >= min_score. So, the higher the min_score, the quality of data increases but the amount of data decreases. By default, min_score is set to \infty and hence all the episodes are loaded.
3. The final trained model should be saved and submitted. You should submit only one model, either for use_offline_data=False or use_offline_data=True. Submit the one that is performing better. The size of your model should not be greater than 2 MB.
4. The function plot_reward should plot the following in the same graph: (i) total reward per episode, and (ii) moving average of the total reward. The plots for both use_offline_data=False and use_offline_data=True should be included in report.pdf.
The following points may also be of significant help:
1. Dont forget to save the model periodically using an automated code. This is because your laptop, Kaggle/Colab notebooks can switch off (or go to sleep) if there is prolonged inactivity which is often the case when you are training a model for a long time. If you save your model, you can load it and start from where your progress stopped.
2. Dont used any complicated neural network model. It will take a lot of time to train it. A neural network model with 2-3 hidden layers and not more than 128 neurons per hidden layer is more than enough. In fact, 128 neurons is too much and so is the size of 2 MB mentioned move. The size of my model is less than 100 KB.
3. GPU will NOT increase the training speed significantly. This is one of the curse of Deep RL (unless you are using advanced techniques like multi-agent RL).
4. Remember that action is one of the inputs to the model. Think whether this input should be ordinal or one-hot-encoded.
give me the entire code for this in lunar lander env and specifically the achitecture mentioned . that is dqn 1 and make sure the reward is increasing as episodes increase

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Understanding Databases Concepts And Practice

Authors: Suzanne W Dietrich

1st Edition

1119827949, 9781119827948

More Books

Students also viewed these Databases questions

Question

Why should the discount rate not be adjusted for political risk?

Answered: 1 week ago

Question

=+7. For the cost matrix of Exercise 3,

Answered: 1 week ago