Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

python help q 3 Policy Iteration vs Value Iteration In Part 3 , you will compare the convergence time of policy iteration and value iteration.

python help q 3
Policy Iteration vs Value Iteration
In Part 3, you will compare the convergence time of policy iteration and value iteration. Both techniques are guaranteed to converge to the optimal policy in a finite number of steps, but we will see that the number of steps required can be considerably different.
3.A - Create Environment
Create a 30x30 instance of the FrozenPlatform environment with sp_range=[0,0.2], a start position of 1(which is the default), no holes, and with random_state=1. You do not need to display the environment.
[]
3.B - Policy Iteration
Create an instance of the DPAgent class for the environment created in Step 3.A. Set gamma=1 and random_state=1. Run policy iteration with the default parameters.
[]
3.C - Value Iteration
Create annother instance of the DPAgent class for the environment created in Step 3.A. Set gamma=1 and random_state=1. Run value iteration with the default parameters.
[]
3.D - Algorithm Comparison
In the previous steps, you should have noticed that policy iteration had a considerably longer runtime than value iteration. You will now explore how these runtimes depend on environment size.
Starter code has been provided to you for this step. The code is intended to use a loop to create FrozenPlatform environments of size 2x2,3x3,4x4, and so on up to 25x25. Both policy iteration and value iteration will be applied to each environment. The time() function from the time module will be used to calculate the runtime for each algorithm, storing the results in two different lists.
After the loop is complete, the cell should output the following two messages, with the blanks filled in with the appropriate values, rounded to 4 decimal places.
%%time
# 3.D
rng = range(______,______)
pol_iter_times =[]
val_iter_times =[]
#np.random.seed(1)
for i in tqdm(rng):
temp_fp = FrozenPlatform(
rows=______, cols=______, sp_range=[0.1,0.4], holes=0, random_state=i
)
t0= time.time()
temp_dp = DPAgent(env=temp_fp, gamma=1, random_state=i)
temp_dp.policy_iteration(report=False)
delta_t = time.time()- t0
______.append(delta_t)
t0= time.time()
temp_dp = DPAgent(env=temp_fp, gamma=1, random_state=i)
temp_dp.value_iteration(report=False)
delta_t = time.time()- t0
______.append(delta_t)
print(f'Average time for policy iteration: {np.mean(______):.4f}')
print(f'Average time for value iteration: {np.mean(______):.4f}')
Visualizing Results
Use Matplotlib to create a figure with two line plots on a single axis. The y-values for each line plot should come from the runtime lists created in the previous cell. The x-values should be the associated environment sizes (2 through 25). Create the figure according to the following specifications.
Set the figsize to [6,3].
The title should read "Runtime Comparison".
The x and y axes should be labeled "Environment Size" and "Runtime (in seconds)", respectively.
Add a legend with labels "Policy Iteration" and "Value Iteration" to explain which line corresponds to which algorithm.
Add a grid to your plot.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

reconcile the variances with the actual performance.

Answered: 1 week ago