Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 22, 2024

How can I structure two DQN models that take state and action as inputs and output Q - values for those state - action pairs?

How can I structure two DQN models that take state and action as inputs and output Q $-$ values for those state $-$ action pairs? I have a skeleton of the code here : import numpy as np

import pandas as pd

import gymnasium as gym

def load $_$ offline $_$ data $($ path $,$ min $_$ score $)$ :

state $_$ data $= []$

action $_$ data $= []$

reward $_$ data $= []$

next $_$ state $_$ data $= []$

terminated $_$ data $= []$

dataset $=$ pd $.$ read $_$ csv $($ path $)$

dataset $_$ group $=$ dataset.groupby $('$ Play # $')$

for play $_$ no $,$ df in dataset $_$ group:

state $=$ np $.$ array $($ df $.$ iloc $[$ : $, 1])$

state $=$ np $.$ array $([$ np $.$ fromstring $($ row $[1$ : $- 1],$ dtype $=$ np $.$ float $32,$ sep $='')$ for row in state $])$

action $=$ np $.$ array $($ df $.$ iloc $[$ : $, 2]) .$ astype $($ int $)$

reward $=$ np $.$ array $($ df $.$ iloc $[$ : $, 3]) .$ astype $($ np $.$ float $32)$

next $_$ state $=$ np $.$ array $($ df $.$ iloc $[$ : $, 4])$

next $_$ state $=$ np $.$ array $([$ np $.$ fromstring $($ row $[1$ : $- 1],$ dtype $=$ np $.$ float $32,$ sep $='')$ for row in next $_$ state $])$

terminated $=$ np $.$ array $($ df $.$ iloc $[$ : $, 5]) .$ astype $($ int $)$

total $_$ reward $=$ np $.$ sum $($ reward $)$

if total $_$ reward $> =$ min $_$ score:

state $_$ data.append $($ state $)$

action $_$ data.append $($ action $)$

reward $_$ data.append $($ reward $)$

next $_$ state $_$ data.append $($ next $_$ state $)$

terminated $_$ data.append $($ terminated $)$

state $_$ data $=$ np $.$ concatenate $($ state $_$ data $)$

action $_$ data $=$ np $.$ concatenate $($ action $_$ data $)$

reward $_$ data $=$ np $.$ concatenate $($ reward $_$ data $)$

next $_$ state $_$ data $=$ np $.$ concatenate $($ next $_$ state $_$ data $)$

terminated $_$ data $=$ np $.$ concatenate $($ terminated $_$ data $)$

return state $_$ data, action $_$ data, reward $_$ data, next $_$ state $_$ data, terminated $_$ data

def plot $_$ reward $($ total $_$ reward $_$ per $_$ episode, window $_$ length $)$ :

# This function should display:

# $($ i $)$ total reward per episode.

# $($ ii $)$ moving average of the total reward. The window for moving average

# should slide by one episode every time.

pass

def DQN $_$ training $($ env $,$ offline $_$ data, use $_$ offline $_$ data $)$ :

# The function should return the final trained DQN model and total reward

# of every episode.

pass

# Initiate the lunar lander environment.

# NO RENDERING. It will slow the training process.

env $=$ gym.make $('$ LunarLander $-$ v $2')$

# Load the offline data collected in step $3 .$ Also, process the dataset.

path $=$ 'lunar $_$ dataset.csv $'$ # This should contain the path to the collected dataset.

min $_$ score $= -$ np $.$ Inf # The minimum total reward of an episode that should be used for training.

offline $_$ data $=$ load $_$ offline $_$ data $($ path $,$ min $_$ score $)$

# Train DQN model of Architecture type $1$

use $_$ offline $_$ data $=$ True # If True then the offline data will be used. Else, offline data will not be used.

final $_$ model, total $_$ reward $_$ per $_$ episode $=$ DQN $_$ training $($ env $,$ offline $_$ data, use $_$ offline $_$ data $)$

# Save the final model

final $_$ model.save $('$ lunar $_$ lander $_$ model.h $5')$ # This line is for Keras. Replace this appropriate code.

# Plot reward per episode and moving average reward

window $_$ length $= 50$ # Window length for moving average reward.

plot $_$ reward $($ total $_$ reward $_$ per $_$ episode, window $_$ length $)$

env.close $()$

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Development For Dummies

Authors: Allen G. Taylor

1st Edition

978-0764507526

More Books

Students also viewed these Databases questions

Question

★★★★★

The maximum magnitude of the electric field in an electromagnetic wave is 0.0675 V/m. What is the maximum magnitude of the magnetic field in this wave?

Answered: 1 week ago

Question

★★★★★

One of the few very consistent gender differences is that men are more physically aggressive than women. True or False

Answered: 1 week ago

Question

★★★★★

In Problem 6, how much labor and wood will be unused if the optimal numbers of chairs and tables are produced?

Answered: 1 week ago

Question

★★★★★

Your buddy comes to you with a sure-fire way to make some quick money and help pay off your student loans. His idea is to sell T-shirts with the words "I get" on them. "You get it?" He says, "You see...

Answered: 1 week ago

Question

★★★★★

Two large diversified consumer product firms X and Y are about to enter the market for a new chemical agent, with very similar products. The two firms are also similar in strategic approach and as no...

Answered: 1 week ago

Question

★★★★★

Calculate the "Room Total" for the first transaction. Use an IF function to calculate a 10% discount for any stay greater than or equal to 7 nights. Copy and paste the function down the column to...

Answered: 1 week ago

Question

★★★★★

During the cherry blossom time in Houston, the sales for 'The Black House' tends to increase. Due to pandemic, the gift shop management decide to keep the shop open during weekends (Friday, Saturday,...

Answered: 1 week ago

Question

★★★★★

A motorist needs to visit a total of 6 cities over the course of the next few days and wants to minimize the total distance driven to visit all 6 cities. The distances between each pair of cities is...

Answered: 1 week ago

Question

★★★★★

The velocity of the bullet with mass mk has an angle with respect to the horizontal. The speed of the bullet is Vk when it hits a block with mass m. The bullet remains caught in the block. As a...

Answered: 1 week ago

Question

★★★★★

5. Neil Special Pottery, LLC has decided to use a regular output of 800 units per month for this year. They can use overtime up to 80 units per month and use subcontracting for the remaining units to...

Answered: 1 week ago

Question

★★★★★

When may a department manager be granted access to the personnel files of employees of other departments?

Answered: 1 week ago

Question

★★★★★

Why should supervisors take time to counsel employees when a disciplinary problem appears to be developing? Why should they not wait until definitive disciplinary action is permissible under an...

Answered: 1 week ago

Question

★★★★★

Why is employee involvement frequently recommended as a strategy for preventing problems?

Answered: 1 week ago

Previous Question Next Question