Answered step by step
Verified Expert Solution
Question
1 Approved Answer
In this section you will train two DQN models of Architecture type 1 , i . e . the DQN model should accept the state
In this section you will train two DQN models of Architecture type ie the DQN model should accept the state and the action as input and the output of the model should be the Qvalue of the stateaction pair given in the input. The first DQN model should be without the data collected in step and the second one uses the data.
Deliverables marks: You are given a Python script training.py This script contains the bare basic skeleton of the DQN training code along with a function that loads the data collected in step You must NOT change the overall structure of the skeleton. There are two functions in training.py: DQNtraining and plotreward. Your task is to write the code for these two functions. Few additional instructions:
This function MUST train DQN of architecture the DQN model should accept the state and the action as input and the output of the model should be the Qvalue of the stateaction pair given in the input for lunar lander environment. The output of the function is the final trained model, and a Numpy array containing total reward per episode.
VERY IMPORTANT: If you are coding DQN model of Architecture type ie the DQN model that accepts state as input and the output is Qvalue of all the stateaction pair you will get a ZERO for this section. There will be NO MERCY in this regard.
In function has an argument called useofflinedata. If this variable is True then the data collected in step should be used for training; else not.
If useofflinedataFalse, then it is business as usual, ie your code will be similar to that in
the resources folder.
If useofflinedataTrue, then we will initialize the replay buffer with the data collected
offline. For the first episodes, you will NOT append any data collected from the interaction with the environment onto the replay buffer. After episodes, the data collected from the interaction with the environment should be appended to the replay buffer. should not be greater than The exact value of is an hyperparameter. In many ways it depends on how good the data collected in step is If you got high total rewards in step can be high. In this regard note that one of the argument of the function loadofflinedata is minscore. Only those episodesplay # will be loaded whose total reward is minscore. So the higher the minscore, the quality of data increases but the amount of data decreases. By default, minscore is set to infty and hence all the episodes are loaded.
The final trained model should be saved and submitted. You should submit only one model, either for useofflinedataFalse or useofflinedataTrue. Submit the one that is performing better. The size of your model should not be greater than MB
The function plotreward should plot the following in the same graph: i total reward per episode, and ii moving average of the total reward. The plots for both useofflinedataFalse and useofflinedataTrue should be included in report.pdf
The following points may also be of significant help:
Dont forget to save the model periodically using an automated code. This is because your laptop, KaggleColab notebooks can switch off or go to sleep if there is prolonged inactivity which is often the case when you are training a model for a long time. If you save your model, you can load it and start from where your progress stopped.
Dont used any complicated neural network model. It will take a lot of time to train it A neural network model with hidden layers and not more than neurons per hidden layer is more than enough. In fact, neurons is too much and so is the size of MB mentioned move. The size of my model is less than KB
GPU will NOT increase the training speed significantly. This is one of the curse of Deep RL unless you are using advanced techniques like multiagent RL
Remember that action is one of the inputs to the model. Think whether this input should be ordinal or onehotencoded.
give me the entire code for this in lunar lander env and specifically the achitecture mentioned that is dqn and make sure the reward is increasing as episodes increase
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started