Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The goal of this assignment is to implement QLearning method on Taxi - v 3 enviroment at openai gym framework. Your task in this enviroment

The goal of this assignment is to implement QLearning method on Taxi-v3 enviroment at openai gym framework.
Your task in this enviroment is to pick up the passenger at one location and drop him off in another, located at possible 4 locations (labeled by different letters). In the example given below, you are expected to pick him up at Y and drop him at G. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Note that dynamics of the model are assumed to be unknown.
below is the original code, impliment the QLearning method accordingly
import gymnasium as gym
import time
import numpy as np
import os
import random
def qLearning(env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS, nA], dtype=np.int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =10000
for i in range (num_iter):
s, actions = env. reset()
for step in range (100):
action = env.action_space.sample()
#action = np.argmax(Q[s])
sp, reward, done, info = env.step (action)
Q[s, action]= Q[s,action]+ alpha *(reward +gamma *np.max (Q[sp,:])- Q[s, action])
S = sp
if i%1000==0 :
print (f"Episode {i}")
return Q
def SARSA (env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS,nA], dtype=np. int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =1000
for i in range (num_iter):
S, actions = env.reset()
a = env.action_space.sample()
for step in range (100):
sp, reward, done, truncated, info = env. step(a)
ap = np.argmax (Q[sp])
Q[S, a]= Q[S,a]+ alpha *(reward + gamma * Q[sp, ap]-Q[S,a])
S = sp
a = ap
if i%1000==0 :
print(f"Episode {i}")
return Q
env = gym.make('Taxi-v3', render_mode="human" )
observation,info = env.reset ()
Q = SARSA (env)
observation = env. reset()
done=False
sumreward =0
while not done:
os. system('cls')
env. render ()
action = np.argmax (Q[observation])
observation, reward, done, truncated, info = env. step(action)
sumreward += reward
time.sleep (0.5)
if done:
observation = env. reset ()
print ('done with reward:', reward)
env. close()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Databases Illuminated

Authors: Catherine Ricardo

2nd Edition

1449606008, 978-1449606008

More Books

Students also viewed these Databases questions