Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives

image text in transcribed

1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives for entering the square. In the event, the agent bumps into the wall and stays put this also counts as entering the square and the agent receives the reward. When the agent enters the terminal state it will then no matter what action is taken, transition to the absorbing state and receive a reward of 0 every tme-step after regardless of the reward function action bsorbing State -.04 04 04 0.8 0.1 -.04 04 Terminal 80% probability of taking the desired action, 10% probability of "strafing, left (or right) instead. If you bump into a wall, you just stay where you were. 2START .04 04 .04 Figure 1: Gridworld MDP 1.3 Choosing a reward function For the MDP in Figure1 consider replacing the reward function with one of the following: 1. Reward 0 everywhere 2. Reward 1 everywhere 3. Reward 0 everywhere except (0, 3) then reward is 1 4. Reward -1 everywhere except (0, 3) then reward is 1 5. Reward 0 everywhere except (0, 3) then reward is 1000 and (3,1) reward is 999 6. Reward 1 everywhere except (3,1) then reward is -1 1.3.1 Behavior of Reward Functions (15 pts) For each of the above reward functions, assune the agent has an optimal policy. Describe what the behavior of the agent that follows an optimal policy for each reward function. Some things you might want to consider: Does the agent head to a terminal state? Does it try to get somewhere as fast as possible? Will it avoid terminal states? Recall that the discount factor is ? = 1 and an optimal policy is any policy that mhaximizes the expected discounted sum of rewards. 1. Answer: 2. Answer: 3. Answer: 4. Answer: 5. Answer: nswer: 1.3.2 Creating Reward Functions (5 pts) Designing reward functions can be tricky and needs to be done carefully so the agent doesn't learn a policy that has undesired behaviour. Create a reward function that incentives the agent to navigate to state (3,1), but avoids the state (2,2). Keep in mind that (3,0) is still a terminal state. Answer: 1.2 Reward Functions (20 pts) For this problem consider the MDP is shown in Figure1. The numbers in each square represent reward the agent receives for entering the square. In the event, the agent bumps into the wall and stays put this also counts as entering the square and the agent receives the reward. When the agent enters the terminal state it will then no matter what action is taken, transition to the absorbing state and receive a reward of 0 every tme-step after regardless of the reward function action bsorbing State -.04 04 04 0.8 0.1 -.04 04 Terminal 80% probability of taking the desired action, 10% probability of "strafing, left (or right) instead. If you bump into a wall, you just stay where you were. 2START .04 04 .04 Figure 1: Gridworld MDP 1.3 Choosing a reward function For the MDP in Figure1 consider replacing the reward function with one of the following: 1. Reward 0 everywhere 2. Reward 1 everywhere 3. Reward 0 everywhere except (0, 3) then reward is 1 4. Reward -1 everywhere except (0, 3) then reward is 1 5. Reward 0 everywhere except (0, 3) then reward is 1000 and (3,1) reward is 999 6. Reward 1 everywhere except (3,1) then reward is -1 1.3.1 Behavior of Reward Functions (15 pts) For each of the above reward functions, assune the agent has an optimal policy. Describe what the behavior of the agent that follows an optimal policy for each reward function. Some things you might want to consider: Does the agent head to a terminal state? Does it try to get somewhere as fast as possible? Will it avoid terminal states? Recall that the discount factor is ? = 1 and an optimal policy is any policy that mhaximizes the expected discounted sum of rewards. 1. Answer: 2. Answer: 3. Answer: 4. Answer: 5. Answer: nswer: 1.3.2 Creating Reward Functions (5 pts) Designing reward functions can be tricky and needs to be done carefully so the agent doesn't learn a policy that has undesired behaviour. Create a reward function that incentives the agent to navigate to state (3,1), but avoids the state (2,2). Keep in mind that (3,0) is still a terminal state

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Spomenik Monument Database

Authors: Donald Niebyl, FUEL, Damon Murray, Stephen Sorrell

1st Edition

0995745536, 978-0995745537

More Books

Students also viewed these Databases questions

Question

8. Design office space to facilitate interaction between employees.

Answered: 1 week ago

Question

Q.1. Taxonomic classification of peafowl, Tiger and cow ?

Answered: 1 week ago

Question

Q .1. Different ways of testing the present adulterants ?

Answered: 1 week ago

Question

Q.1. Health issues caused by adulteration data ?

Answered: 1 week ago