Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The goal of this assignment is to implement QLearning method on Taxi - v 3 enviroment at openai gym framework. Your task in this enviroment

The goal of this assignment is to implement QLearning method on Taxi-v3 enviroment at openai gym framework.
Your task in this enviroment is to pick up the passenger at one location and drop him off in another, located at possible 4 locations (labeled by different letters). In the example given below, you are expected to pick him up at Y and drop him at G. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Note that dynamics of the model are assumed to be unknown.
below is the original code, impliment the QLearning method accordingly:
import gymnasium as gym
import time
import numpy as np
import os
import random
def qLearning(env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS, nA], dtype=np.int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =10000
for i in range (num_iter):
s, actions = env. reset()
for step in range (100):
action = env.action_space.sample()
#action = np.argmax(Q[s])
sp, reward, done, info = env.step (action)
Q[s, action]= Q[s,action]+ alpha *(reward +gamma *np.max (Q[sp,:])- Q[s, action])
S = sp
if i%1000==0 :
print (f"Episode {i}")
return Q
def SARSA (env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS,nA], dtype=np. int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =1000
for i in range (num_iter):
S, actions = env.reset()
a = env.action_space.sample()
for step in range (100):
sp, reward, done, truncated, info = env. step(a)
ap = np.argmax (Q[sp])
Q[S, a]= Q[S,a]+ alpha *(reward + gamma * Q[sp, ap]-Q[S,a])
S = sp
a = ap
if i%1000==0 :
print(f"Episode {i}")
return Q
env = gym.make('Taxi-v3', render_mode="human" )
observation,info = env.reset ()
Q = SARSA (env)
observation = env. reset()
done=False
sumreward =0
while not done:
os. system('cls')
env. render ()
action = np.argmax (Q[observation])
observation, reward, done, truncated, info = env. step(action)
sumreward += reward
time.sleep (0.5)
if done:
observation = env. reset ()
print ('done with reward:', reward)
env. close()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions