Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Below i have attached the training.py code . Give the DQN architecture 1 code which takes state and action as input and returns only one
Below i have attached the training.py code
Give the DQN architecture
code which takes state and action as input and returns only one q value. Integrate the code with the training.py and give
Make sure you are getting a increasing reward trend as episodes increase and dont use the offline data as mentioned in the image. Other guidelines are there in image
from keras.models import Model
from keras.layers import Input, Dense, Lambda
from keras.optimizers import Adam
import keras.backend as K
from collections import deque
import random
# Constants
BATCHSIZE
GAMMA
EPSILONSTART
EPSILONMIN
EPSILONDECAY
LEARNINGRATE
# Dueling DQN Model Architecture
def createduelingdqnmodelinputshape, actionspace:
stateinput Inputshapeinputshape,
x Dense activation'relu'stateinput
x Dense activation'relu'x
x Dense activation'relu'x
statevalue Densex
statevalue Lambdalambda s: Kexpanddimss: outputshapeactionspace,statevalue
actionadvantage Denseactionspacex
actionadvantage Lambdalambda a: a: : Kmeana: : keepdimsTrue outputshapeactionspace,actionadvantage
qvalues Lambdalambda w: w w outputshapeactionspace,statevalue, actionadvantage
model Modelinputsstateinput, outputsqvalues
model.compileloss'mse', optimizerAdamlrLEARNINGRATE
return model
# DQN Training Function
def DQNtrainingenv offlinedata, useofflinedataFalse:
statesize env.observationspace.shape
actionsize env.actionspace.n
model createduelingdqnmodelstatesize, actionsize
replaybuffer dequemaxlen
epsilon EPSILONSTART
totalrewardperepisode
for episode in range: # Number of episodes
state env.reset
state npreshapestate statesize
totalreward
for timestep in range: # Max steps in an episode
if nprandom.rand epsilon:
action env.actionspace.sample # Explore action space
else:
qvalues model.predictstate
action npargmaxqvalues # Exploit learned values
nextstate, reward, done, env.stepaction
nextstate npreshapenextstate, statesize
totalreward reward
if not useofflinedata: # Only save and learn if not using offline data
replaybuffer.appendstate action, reward, nextstate, done
if lenreplaybuffer BATCHSIZE:
minibatch random.samplereplaybuffer, BATCHSIZE
for s a r ns d in minibatch:
target r
if not d:
target r GAMMA npamaxmodelpredictns
targetf model.predicts
targetfa target
model.fits targetf epochs verbose
state nextstate
if done:
break
totalrewardperepisode.appendtotalreward
# Update epsilon
epsilon maxEPSILONMIN, epsilon EPSILONDECAY
return model, nparraytotalrewardperepisode
# Replace this line with any initialization of the environment required before training
# env gym.makeLunarLanderv
# Do not load offline data
useofflinedata False
# Now you would call DQNtraining like this:
# finalmodel, totalrewardperepisode DQNtrainingenv None, useofflinedata
# After training, you'd save your model and plot the rewards.
Section : Train DQN Model
In this section you will train two DQN models of Architecture type ie the DQN model should accept
the state and the action as input and the output of the model should be the Qvalue of the stateaction
pair given in the input. The first DQN model should be without the data collected in step and the
second one uses the data.
VERY IMPORTANT: If you are coding DQN model of Architecture type ie the DQN
model that accepts state as input and the output is Qvalue of all the stateaction pair
you will get a ZERO for this section. There will be NO MERCY in this regard.
Deliverables marks: You are given a Python script
training.py This script contains the bare basic
skeleton of the DQN training code along with a function that loads the data collected in step You must
NOT change the overall structure of the skeleton. There are two functions in
training.py: DQNtraining
and plotreward. Your task is to write the code for these two functions. Few additional instructions:
This function MUST train DQN of architecture the DQN model should accept the state and the
action as input and the output of the model should be the Qvalue of the stateaction pair give
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started