Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The parameters , , , and captures the probability distributions of state transition and reward. In this section, you will compute the optimal policy for

The parameters ,
,, and captures the probability distributions of state transition and reward. In
this section, you will compute the optimal policy for problem 3 assuming that ,
,, and are known.
Deliverables:
1.(15 marks) Implement value iteration to solve the Bellman optimality equation for problem 3. You
MUST implement this code in the function value_iteration() in the Python script planning.py that
is provided to you. The function MUST return the optimal value function and the optimal policy.
You MUST NOT include anything in report.docx for this part.
2.(15 marks) Implement value iteration to solve the Bellman optimality equation for problem 3. You
MUST implement this code in the function policy_iteration() in the Python script planning.py that
is provided to you. The function MUST return the optimal value function and the optimal policy.
You MUST NOT include anything in the report.docx for this part.
3. Using the codes for value and policy iteration answer the following questions. For all these
questions, your answer MUST be in report.docx ONLY. Any code that you write to answer these
questions MUST NOT be there in planning.py or report.docx.
a.(2 marks) Compute average computation time of value iteration and policy iteration over
5 runs. For each run, you must use different values of ,
,, and . You can import the
time library and use the time.time() function for this question. Which approach is faster?
b.(2 marks) It seems intuitive that the agent should save more non-renewable energy in the
battery when the non-renewable energy production is high compared to when it is low.
Document necessary analysis (maybe plots) in report.docx that either validates this
intuition or negate it. Use your imagination to decide what analysis you want to do.
c.(3 marks) In Problem 3, the power plant has an additional constraint compared to Problem
2. Hence, it seems intuitive that the -discounted cost for Problem 3 should be more than
Problem 2. Document necessary analysis (maybe plots) in report.docx that either validates
this intuition or negate it. Use your imagination to decide what analysis you want to do.
d.(3 marks) It seems intuitive that battery becomes more important when the renewable
energy is highly fluctuating (higher standard deviation). This is because the job of the
battery is to smoothen out the fluctuation in renewable energy production (as mentioned
in the introduction section). Hence, the difference between the -discounted cost of
Problems 1 and 2 should increase as the standard deviation of increases. Document
necessary analysis (maybe plots) in report.docx that either validates this intuition or negate
it. Use your imagination to decide what analysis you want to do.
(PTO)
Read these points before attempting the deliverables:
1. For all the above deliverables, set ===10,=2, and =0.95 unless mentioned
otherwise. should be varied between 0.25 to 4.
2. In the folder titled data, you have eight transition probability matrices, , that you can use to test
your code. The instruction to load these matrices is there in planning.py.
3. The function prob_vector_generator() can be used to get the probability distribution, , that has
a pre-specified mean and standard deviation. The instruction to use this function is there in
planning.py.
4. For a given set of system parameters, the optimal value function obtained using value and policy
iteration should almost be the same. Otherwise, there is something wrong with your code.
5. planning.py is just the skeleton code. You have to add necessary arguments to both
value_iteration() and policy_iteration(). You may also add additional functions in planning.py if it
helps in improving the code.
6. While coding the Q-function, you may want to use the broadcasting operation of Numpy to speed
your computation. In my laptop, one run of policy iteration was not taking more than 5 minutes.
You can use this information to benchmark you code.
7. To answer 3c and 3d, you do not have to code value iteration and policy iteration for Problems 1
and 2 separately. By setting =0 and =, we can convert Problem 3 to Problem 1.
Similarly, by setting =, we can convert Problem 3 to Problem 2.
8. You must take into consideration that different states may have different action space. This means
a few things. First, while implementing value/policy iteration, the maxima/minima should be over
the action space corresponding to the state. Second, for policy iteration, the policy should be
initialized such that the actions corresponding to a state are in the action space of the state.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions