Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 13, 2024

The goal of this assignment is to implement QLearning method on Taxi - v 3 enviroment at openai gym framework. Your task in this enviroment

The goal of this assignment is to implement QLearning method on Taxi $-$ v $3$ enviroment at openai gym framework.

Your task in this enviroment is to pick up the passenger at one location and drop him off in another, located at possible $4$ locations $($ labeled by different letters $) .$ In the example given below, you are expected to pick him up at Y and drop him at G $.$ You receive $+ 20$ points for a successful dropoff, and lose $1$ point for every timestep it takes. There is also a $10$ point penalty for illegal pick $-$ up and drop $-$ off actions.

Note that dynamics of the model are assumed to be unknown.

below is the original code, impliment the QLearning method accordingly:

import gymnasium as gym

import time

import numpy as np

import os

import random

def qLearning $($ env $)$ :

nS $=$ env.observation $_$ space.n

nA $=$ env.action $_$ space.n

Q $=$ np $.$ zeros $([$ nS $,$ nA $],$ dtype $=$ np $.$ int $32)$

alpha $= 0.8$

gamma $= 0.9$

epsilon $= 1$

num $_$ iter $= 10000$

for i in range $($ num $_$ iter $)$ :

s $,$ actions $=$ env. reset $()$

for step in range $(100)$ :

action $=$ env.action $_$ space.sample $()$

#action $=$ np $.$ argmax $($ Q $[$ s $])$

sp $,$ reward, done, info $=$ env.step $($ action $)$

Q $[$ s $,$ action $] =$ Q $[$ s $,$ action $] +$ alpha $* ($ reward $+$ gamma $$ np $.$ max $($ Q $[$ sp $,$ : $]) -$ Q $[$ s $,$ action $])$

S $=$ sp

if i $% 1000 = = 0$ :

print $($ f $"$ Episode ${$ i $} ")$

return Q

def SARSA $($ env $)$ :

nS $=$ env.observation $_$ space.n

nA $=$ env.action $_$ space.n

Q $=$ np $.$ zeros $([$ nS $,$ nA $],$ dtype $=$ np $.$ int $32)$

alpha $= 0.8$

gamma $= 0.9$

epsilon $= 1$

num $_$ iter $= 1000$

for i in range $($ num $_$ iter $)$ :

S $,$ actions $=$ env.reset $()$

a $=$ env.action $_$ space.sample $()$

for step in range $(100)$ :

sp $,$ reward, done, truncated, info $=$ env. step $($ a $)$

ap $=$ np $.$ argmax $($ Q $[$ sp $])$

Q $[$ S $,$ a $] =$ Q $[$ S $,$ a $] +$ alpha $ ($ reward $+$ gamma $*$ Q $[$ sp $,$ ap $] -$ Q $[$ S $,$ a $])$

S $=$ sp

a $=$ ap

if i $% 1000 = = 0$ :

print $($ f $"$ Episode ${$ i $} ")$

return Q

env $=$ gym.make $('$ Taxi $-$ v $3',$ render $_$ mode $=$ "human" $)$

observation,info $=$ env.reset $()$

Q $=$ SARSA $($ env $)$

observation $=$ env. reset $()$

done $=$ False

sumreward $= 0$

while not done:

os $.$ system $('$ cls $')$

env. render $()$

action $=$ np $.$ argmax $($ Q $[$ observation $])$

observation, reward, done, truncated, info $=$ env. step $($ action $)$

sumreward $+ =$ reward

time.sleep $(0.5)$

if done:

observation $=$ env. reset $()$

print $('$ done with reward: $',$ reward $)$

env. close $()$

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

LO 46-2 What factors underlie aggression and prosocial behavior?

Answered: 1 week ago

Question

★★★★★

James Albemarle created a trust fund at the beginning of 2016. The income from this fund will go to his son Edward. When Edward reaches the age of 25, the principal of the fund will be conveyed to...

Answered: 1 week ago

Question

★★★★★

Consider the following situations: Rupert plc operates an offshore oilfield. The companys licensing agreement requires it to remove the oil rig at the end of production and restore the seabed. Rupert...

Answered: 1 week ago

Question

★★★★★

ExxonMobil Ordinary Shares: $3,000,000 10% Debentures: $1,000,000 Chevron Corporation Ordinary Shares: $2,500,000 12% Debentures: $800,000 BP Plc Ordinary Shares: 2,200,000 11% Debentures: 700,000...

Answered: 1 week ago

Question

★★★★★

1/ The cost of housing is based on supply and demand. Group of answer choices True False Source: https://www.youtube.com/watch?v=R0h8kfA4i_A 2/ According to the video, in New York City,...

Answered: 1 week ago

Question

★★★★★

Trevor has owned a large block of land for 3 years and has just decided to subdivide this large block into three allotments and build a home unit on each subdivided allotment using a builder for...

Answered: 1 week ago

Question

★★★★★

Henry Mintzberg's argument is that strategic management is an artistic endeavor. If he's right, then what value is there in systematic approaches to strategic planning such as those elaborated in...

Answered: 1 week ago

Question

★★★★★

Describe some situations in a math class that metaphorically fit the broken escalator situation (being stuck, not moving forward, something going wrong).

Answered: 1 week ago

Question

★★★★★

1. Please read this 1927 Supreme Court case and analyze it using the IRAC format, which was introduced in our first chapter. 5 Gong Lum v. Rice, 275 U.S. 78 (1927).pdf Download 5 Gong Lum v. Rice,...

Answered: 1 week ago

Previous Question Next Question