Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The parameters , , , and captures the probability distributions of state transition and reward. In this section, you will compute the optimal policy for
The parameters
and captures the probability distributions of state transition and reward. In
this section, you will compute the optimal policy for problem assuming that
and are known.
Deliverables:
marks Implement value iteration to solve the Bellman optimality equation for problem You
MUST implement this code in the function valueiteration in the Python script planning.py that
is provided to you. The function MUST return the optimal value function and the optimal policy.
You MUST NOT include anything in report.docx for this part.
marks Implement value iteration to solve the Bellman optimality equation for problem You
MUST implement this code in the function policyiteration in the Python script planning.py that
is provided to you. The function MUST return the optimal value function and the optimal policy.
You MUST NOT include anything in the report.docx for this part.
Using the codes for value and policy iteration answer the following questions. For all these
questions, your answer MUST be in report.docx ONLY. Any code that you write to answer these
questions MUST NOT be there in planning.py or report.docx.
a marks Compute average computation time of value iteration and policy iteration over
runs. For each run, you must use different values of
and You can import the
time library and use the time.time function for this question. Which approach is faster?
b marks It seems intuitive that the agent should save more nonrenewable energy in the
battery when the nonrenewable energy production is high compared to when it is low.
Document necessary analysis maybe plots in report.docx that either validates this
intuition or negate it Use your imagination to decide what analysis you want to do
c marks In Problem the power plant has an additional constraint compared to Problem
Hence, it seems intuitive that the discounted cost for Problem should be more than
Problem Document necessary analysis maybe plots in report.docx that either validates
this intuition or negate it Use your imagination to decide what analysis you want to do
d marks It seems intuitive that battery becomes more important when the renewable
energy is highly fluctuating higher standard deviation This is because the job of the
battery is to smoothen out the fluctuation in renewable energy production as mentioned
in the introduction section Hence, the difference between the discounted cost of
Problems and should increase as the standard deviation of increases. Document
necessary analysis maybe plots in report.docx that either validates this intuition or negate
it Use your imagination to decide what analysis you want to do
PTO
Read these points before attempting the deliverables:
For all the above deliverables, set and unless mentioned
otherwise. should be varied between to
In the folder titled data, you have eight transition probability matrices, that you can use to test
your code. The instruction to load these matrices is there in planning.py
The function probvectorgenerator can be used to get the probability distribution, that has
a prespecified mean and standard deviation. The instruction to use this function is there in
planning.py
For a given set of system parameters, the optimal value function obtained using value and policy
iteration should almost be the same. Otherwise, there is something wrong with your code.
planning.py is just the skeleton code. You have to add necessary arguments to both
valueiteration and policyiteration You may also add additional functions in planning.py if it
helps in improving the code.
While coding the Qfunction, you may want to use the broadcasting operation of Numpy to speed
your computation. In my laptop, one run of policy iteration was not taking more than minutes.
You can use this information to benchmark you code.
To answer c and d you do not have to code value iteration and policy iteration for Problems
and separately. By setting and we can convert Problem to Problem
Similarly, by setting we can convert Problem to Problem
You must take into consideration that different states may have different action space. This means
a few things. First, while implementing valuepolicy iteration, the maximaminima should be over
the action space corresponding to the state. Second, for policy iteration, the policy should be
initialized such that the actions corresponding to a state are in the action space of the state.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started