Show your goal searching process with step to go curve, sum of squared error and or theoretical value table with diagrams and graphs and table for the following below code import numpy as np import random Define the grid world GRID SIZE ( 4 , 5 ) START STATE ( 0 , 0 ) GOAL STATE ( 3 , 4 ) OBSTACLES ( 1 , 1 ) , ( 2 , 2 ) , ( 1 , 3 ) Q learning parameters LEARNING RATE 0 1 DISCOUNT FACTOR 0 9 EPISODES 5 0 0 Initialize Q table q table np zeros ( ( GRID SIZE 0 , GRID SIZE 1 , 4 ) ) 4 actions up , down, left, right Define actions ACTIONS UP , DOWN , LEFT , RIGHT Function to choose an action using epsilon greedy strategy def choose action ( state , epsilon ) if random uniform ( 0 , 1 ) epsilon return random choice ( range ( 4 ) ) choose a random action else return np argmax ( q table state 0 , state 1 ) Function to perform Q learning def q learning ( ) for episode in range ( EPISODES ) state START STATE while state GOAL STATE action choose action ( state , epsilon 0 1 ) next state take action ( state , action ) reward calculate reward ( next state ) update q table ( state , action, reward, next state ) state next state Function to take an action and return the next state def take action ( state , action ) if action 0 UP return ( max ( 0 , state 0 1 ) , state 1 ) elif action 1 DOWN return ( min ( GRID SIZE 0 1 , state 0 1 ) , state 1 ) elif action 2 LEFT return ( state 0 , max ( 0 , state 1 1 ) ) elif action 3 RIGHT return ( state 0 , min ( GRID SIZE 1 1 , state 1 1 ) ) Function to calculate the reward for a given state def calculate reward ( state ) if state GOAL STATE return 1 elif state in OBSTACLES return 1 else return 0 Function to update the Q table based on the Q learning update rule def update q table ( state , action, reward, next state ) best future value np max ( q table next state 0 , next state 1 ) current value q table state 0 , state 1 , action new value ( 1 LEARNING RATE ) current value LEARNING RATE ( reward DISCOUNT FACTOR best future value ) q table state 0 , state 1 , action new value Run Q learning algorithm q learning ( ) Print the learned Q table print ( Learned Q table ) print ( q table )

The Answer is in the image, click to view ...

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 23, 2024

Show your goal searching process with step - to - go curve, sum of squared error and / or theoretical value table with diagrams and

Show your goal searching process with step

-

-

go curve, sum of squared error and

/

or theoretical value table with diagrams and graphs and table for the following below code

import numpy as np

import random

# Define the grid world

GRID

_

SIZE

= (4, 5)

START

_

STATE

= (0, 0)

GOAL

_

STATE

= (3, 4)

OBSTACLES

= [(1, 1), (2, 2), (1, 3)]

# Q

-

learning parameters

LEARNING

_

RATE

= 0.1

DISCOUNT

_

FACTOR

= 0.9

EPISODES

= 500

# Initialize Q

-

table

_

table

=

.

zeros

((

GRID

_

SIZE

[0],

GRID

_

SIZE

[1], 4))

4

actions: up

,

down, left, right

# Define actions

ACTIONS

= ["

",

"DOWN", "LEFT", "RIGHT"

]

# Function to choose an action using epsilon

-

greedy strategy

def choose

_

action

(

state

,

epsilon

)

if random.uniform

(0, 1) <

epsilon:

return random.choice

(

range

(4))

# choose a random action

else:

return np

.

argmax

(

_

table

[

state

[0],

state

[1]])

# Function to perform Q

-

learning

def q

_

learning

()

for episode in range

(

EPISODES

)

state

=

START

_

STATE

while state

! =

GOAL

_

STATE:

action

=

choose

_

action

(

state

,

epsilon

= 0.1)

_

state

=

take

_

action

(

state

,

action

)

reward

=

calculate

_

reward

(

_

state

)

update

_

_

table

(

state

,

action, reward, next

_

state

)

state

=

_

state

# Function to take an action and return the next state

def take

_

action

(

state

,

action

)

if action

= = 0

: # UP

return

(

max

(0,

state

[0] - 1),

state

[1])

elif action

= = 1

: # DOWN

return

(

min

(

GRID

_

SIZE

[0] - 1,

state

[0] + 1),

state

[1])

elif action

= = 2

: # LEFT

return

(

state

[0],

max

(0,

state

[1] - 1))

elif action

= = 3

: # RIGHT

return

(

state

[0],

min

(

GRID

_

SIZE

[1] - 1,

state

[1] + 1))

# Function to calculate the reward for a given state

def calculate

_

reward

(

state

)

if state

= =

GOAL

_

STATE:

return

1

elif state in OBSTACLES:

return

- 1

else:

return

0

# Function to update the Q

-

table based on the Q

-

learning update rule

def update

_

_

table

(

state

,

action, reward, next

_

state

)

best

_

future

_

value

=

.

max

(

_

table

[

_

state

[0],

_

state

[1]])

current

_

value

=

_

table

[

state

[0],

state

[1],

action

]

new

_

value

= (1 -

LEARNING

_

RATE

) *

current

_

value

+

LEARNING

_

RATE

* (

reward

+

DISCOUNT

_

FACTOR

*

best

_

future

_

value

)

_

table

[

state

[0],

state

[1],

action

] =

new

_

value

# Run Q

-

learning algorithm

_

learning

()

# Print the learned Q

-

table

("

Learned Q

-

table:"

)

(

_

table

)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Sql++ For Sql Users A Tutorial

Authors: Don Chamberlin

1st Edition

0692184503, 978-0692184509

More Books

Students also viewed these Databases questions

Question

★★★★★

For Evergreen Environmental Engineering (EEE), determine the working capital, current ratio, and acidtest ratio. Evaluate the company's economic situation with respect to its ability to payoff debt....

Answered: 1 week ago

Question

★★★★★

The makers of Mini-Oats Cereal have an automated packaging machine that can be set at any targeted fill level between 12 and 32 ounces. At the end of every shift (eight hours), 16 boxes are selected...

Answered: 1 week ago

Question

★★★★★

Dissociative identity disorder is controversial because a. dissociation is quite rare. b. it was reported frequently in the 1920s but is rarely reported today. c. it is almost never reported outside...

Answered: 1 week ago

Question

★★★★★

Prepare all journal entries (budgetary and actual) required in all funds and the GCA-GLTL accounts to record the following transactions and events: 1. The county sold old equipmentoriginal cost...

Answered: 1 week ago

Question

★★★★★

Show your goal searching process with step - to - go curve, sum of squared error and / or theoretical value table with diagrams and graphs and table for the following below code import numpy as np...

Answered: 1 week ago

Question

★★★★★

At the surface, air circulates around and in toward a hurricane's low pressure. True or False Group of answer choices True False

Answered: 1 week ago

Question

★★★★★

This assignment is very urgent! Be as Accurate as possible providing explanations to each answer. Chapter 1 1.Managerial Accounting and Financial Accounting differ in the following way: A. Financial...

Answered: 1 week ago

Question

★★★★★

How can a company effectively use social media marketing to increase conversions and improve customer retention?

Answered: 1 week ago

Question

★★★★★

Mention Newton's laws with examples and on what is Newton's third law based?

Answered: 1 week ago

Question

★★★★★

Write Kepler's laws and tell on which principle Kepler's laws are based. Explain in detail.

Answered: 1 week ago

Question

★★★★★

Different formulas for mathematical core areas.

Answered: 1 week ago

Question

★★★★★

(Appendices) Identify the accounting items for which adjustments are made to the purchase price of goods acquired for resale when determining the cost of purchases. Assume that the firm uses the...

Answered: 1 week ago

Question

★★★★★

(Appendices) ERRORS IN ENDING INVENTORY. From time to time, business newspapers report that the management of a company has misstated its profits by knowingly establishing an incorrect amount for its...

Answered: 1 week ago

Question

★★★★★

(Appendices) DISCOUNT POLICY AND GROSS MARGIN. Compton Electronics sells Motorola cellular phones. During 19x8, Compton sold 1,000 units for $300 per unit. Each unit costs Compton $180. At present,...

Answered: 1 week ago

Previous Question Next Question