What is the learned Q table for the following code Please run the code and show the output import numpy as np import matplotlib pyplot as plt Grid world size WORLD SIZE 1 0 Percentage of cells occupied by obstacles OBSTACLE DENSITY 0 1 5 Learning parameters ALPHA 0 5 GAMMA 0 9 EPSILON 0 1 def initialize world ( ) Create empty grid world np zeros ( ( WORLD SIZE, WORLD SIZE ) ) Set start in bottom left world 0 , 0 2 Set goal in top right world 1 , 1 3 Add random obstacles num obstacles int ( OBSTACLE DENSITY WORLD SIZE 2 ) obstacle indices np random choice ( range ( WORLD SIZE 2 ) , size num obstacles, replace False ) for i in obstacle indices x i WORLD SIZE y i WORLD SIZE world x , y 1 return world def initialize q values ( ) Q ( s , a ) initialized to 0 for all s , a q values for x in range ( WORLD SIZE ) for y in range ( WORLD SIZE ) for a in range ( 4 ) up , down, left, right q values ( x , y , a ) 0 0 return q values def epsilon greedy ( state , q values, epsilon ) With probability epsilon, take random action Otherwise, take greedy action based on current Q values if np random rand ( ) epsilon action np random randint ( 4 ) else values q values ( state 0 , state 1 , a ) for a in range ( 4 ) action np argmax ( values ) return action def update q value ( state , action, reward, next state, q values, alpha, gamma ) Q learning update rule max q next max ( q values ( next state 0 , next state 1 , a ) for a in range ( 4 ) ) q values ( state 0 , state 1 , action ) alpha ( reward gamma max q next q values ( state 0 , state 1 , action ) ) return q values def check goal ( state ) return state ( WORLD SIZE 1 , WORLD SIZE 1 ) if name main Create world world initialize world ( ) Initialize Q values q values initialize q values ( ) Track metrics steps per episode sse for episode in range ( 1 0 0 0 ) Reset agent to start position state ( 0 , 0 ) step 0 episode sse 0 while not check goal ( state ) Choose action using epsilon greedy action epsilon greedy ( state , q values, EPSILON ) Take action and get reward next state if action 0 up next state ( state 0 1 , state 1 ) elif action 1 down next state ( state 0 1 , state 1 ) elif action 2 left next state ( state 0 , state 1 1 ) else right next state ( state 0 , state 1 1 ) reward 0 1 if world next state 1 Hit obstacle reward 1 next state state Stay in current state if check goal ( next state ) reward 1 0 Update Q value q values update q value ( state , action, reward, next state, q values, ALPHA, GAMMA ) Calculate SSE episode sse ( reward GAMMA max ( q values ( next state 0 , next state 1 , a ) for a in range ( 4 ) ) q values ( state 0 , state 1 , action ) ) 2 Update state state next state step 1 steps per episode append ( step ) sse append ( episode sse ) Plot results plt plot ( steps per episode ) plt xlabel ( ' Episode ' ) plt ylabel ( ' Steps per episode' ) plt savefig ( ' steps png ' ) plt plot ( sse ) plt xlabel ( ' Episode ' ) plt ylabel ( ' Sum squared error' ) plt savefig ( ' sse png ' ) Print learned policy policy for x in range ( WORLD SIZE ) for y in range ( WORLD SIZE ) values q values ( x , y , a ) for a in range ( 4 ) policy ( x , y ) np argmax ( values ) print ( Learned Optimal Policy ) print ( policy ) Print the learned Q table print ( Learned Q table ) print ( q table )

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 25, 2024

What is the learned Q table for the following code? Please run the code and show the output. import numpy as np import matplotlib.pyplot as

What is the learned Q

table for the following code? Please run the code and show the output.

import numpy as np

import matplotlib.pyplot as plt

# Grid world size

WORLD

_

SIZE

= 10

# Percentage of cells occupied by obstacles

OBSTACLE

_

DENSITY

= 0.15

# Learning parameters

ALPHA

= 0.5

GAMMA

= 0.9

EPSILON

= 0.1

def initialize

_

world

()

# Create empty grid

world

=

.

zeros

((

WORLD

_

SIZE, WORLD

_

SIZE

))

# Set start in bottom left

world

[0, 0] = 2

# Set goal in top right

world

[- 1, - 1] = 3

# Add random obstacles

num

_

obstacles

=

int

(

OBSTACLE

_

DENSITY

*

WORLD

_

SIZE

* * 2)

obstacle

_

indices

=

.

random.choice

(

range

(

WORLD

_

SIZE

* * 2),

size

=

num

_

obstacles, replace

=

False

)

for i in obstacle

_

indices:

=

/ /

WORLD

_

SIZE

=

%

WORLD

_

SIZE

world

[

,

] = 1

return world

def initialize

_

_

values

()

# Q

(

,

)

initialized to

0

for all s

,

_

values

= {}

for x in range

(

WORLD

_

SIZE

)

for y in range

(

WORLD

_

SIZE

)

for a in range

(4)

: # up

,

down, left, right

_

values

[(

,

,

)] = 0.0

return q

_

values

def epsilon

_

greedy

(

state

,

_

values, epsilon

)

# With probability epsilon, take random action

# Otherwise, take greedy action based on current Q values

if np

.

random.rand

() <

epsilon:

action

=

.

random.randint

(4)

else:

values

= [

_

values

[(

state

[0],

state

[1],

)]

for a in range

(4)]

action

=

.

argmax

(

values

)

return action

def update

_

_

value

(

state

,

action, reward, next

_

state, q

_

values, alpha, gamma

)

# Q

-

learning update rule

max

_

_

=

max

([

_

values

[(

_

state

[0],

_

state

[1],

)]

for a in range

(4)])

_

values

[(

state

[0],

state

[1],

action

)] + =

alpha

* (

reward

+

gamma

*

max

_

_

-

_

values

[(

state

[0],

state

[1],

action

)])

return q

_

values

def check

_

goal

(

state

)

return state

= = (

WORLD

_

SIZE

- 1,

WORLD

_

SIZE

- 1)

__

name

__= = "__

main

__"

# Create world

world

=

initialize

_

world

()

# Initialize Q values

_

values

=

initialize

_

_

values

()

# Track metrics

steps

_

per

_

episode

= []

sse

= []

for episode in range

(1000)

# Reset agent to start position

state

= (0, 0)

step

= 0

episode

_

sse

= 0

while not check

_

goal

(

state

)

# Choose action using epsilon

-

greedy

action

=

epsilon

_

greedy

(

state

,

_

values, EPSILON

)

# Take action and get reward

/

next state

if action

= = 0

: # up

_

state

= (

state

[0] - 1,

state

[1])

elif action

= = 1

: # down

_

state

= (

state

[0] + 1,

state

[1])

elif action

= = 2

: # left

_

state

= (

state

[0],

state

[1] - 1)

else: # right

_

state

= (

state

[0],

state

[1] + 1)

reward

= - 0.1

if world

[

_

state

] = = 1

: # Hit obstacle

reward

= - 1

_

state

=

state # Stay in current state

if check

_

goal

(

_

state

)

reward

= 10

# Update Q value

_

values

=

update

_

_

value

(

state

,

action, reward, next

_

state, q

_

values, ALPHA, GAMMA

)

# Calculate SSE

episode

_

sse

+ = (

reward

+

GAMMA

*

max

([

_

values

[(

_

state

[0],

_

state

[1],

)]

for a in range

(4)]) -

_

values

[(

state

[0],

state

[1],

action

)]) * * 2

# Update state

state

=

_

state

step

+ = 1

steps

_

per

_

episode.append

(

step

)

sse.append

(

episode

_

sse

)

# Plot results

plt

.

plot

(

steps

_

per

_

episode

)

plt

.

xlabel

('

Episode

')

plt

.

ylabel

('

Steps per episode'

)

plt

.

savefig

('

steps

.

png

')

plt

.

plot

(

sse

)

plt

.

xlabel

('

Episode

')

plt

.

ylabel

('

Sum squared error'

)

plt

.

savefig

('

sse

.

png

')

# Print learned policy

policy

= {}

for x in range

(

WORLD

_

SIZE

)

for y in range

(

WORLD

_

SIZE

)

values

= [

_

values

[(

,

,

)]

for a in range

(4)]

policy

[(

,

)] =

.

argmax

(

values

)

("

Learned Optimal Policy:"

)

(

policy

)

# Print the learned Q

-

table

("

Learned Q

-

table:"

)

(

_

table

)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Constraint Based Mining And Inductive Databases European Workshop On Inductive Databases And Constraint Based Mining Hinterzarten Germany March 2004 Revised Selected Papers Lnai 3848

Authors: Jean-Francois Boulicaut ,Luc De Raedt ,Heikki Mannila

Question How are stock options treated by the company for accounting purposes?

Answered: 1 week ago

Previous Question Next Question