Question: We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or a value function in RL. The input to

We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or a value function in RL. The input to

We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or a value function in RL. The input to the neural network are features of state. The weights of the neural network are the parameters to be updated. There is one output corresponding to each action in the control setting, and a single output in the prediction setting. For reference, a pictorial representation of the neural network is shown below, and the notation explained thereafter. x(s, i) i O Input layer Fully connected Wij O j Fully connected O Hidden layer O k outk(s, k) Output layer Let there be I, J, and K nodes in the input, hidden, and output layers, respectively. For convenience, define [I] = {0, 1,..., I-1}, [J] = {0, 1,.... J-1}, and [K] = {0, 1,..., K-1}. In the figure, I = 3, J = 4, K = 3. For states and i E [I], let xr(s, i) denote the i-th feature of state. The input layer merely passes on its input to output. Hence, for states and i [I], we have outI (s, i) = x(s, i). Every j-th node in the hidden layer linearly combines the outputs of the input layer, and passes the weighted sum through the sigmoid function (3) = 1 for 3 R. Thus, for states and j [J], out J(s, j) = o(ie[1] Wijout I (s, i)). Observe that there is a weight wij connecting each input node i E [I] with each hidden node j [J]. Every k-th node in the output layer corresponds to an action (hence take the set of actions as [K]). Node k [K] linearly combines the outputs of the hidden layer. Thus, for state s and k = [K], out K(s, k) = je[J] Wikout.J(s, j). Again, there is a weight wjk connecting each hidden node je [J] with each output node k [K]. For uniformity of notation, think of the prediction task as having a single action (implemented by taking K = 1). The parameters of the representation are the weights wij for i E [I], j [J] and the weights Wjk for j = [J], k [K]; in pseudocode you are asked to provide for this question (see below), store these parameters in 2-d arrays wIJ[][] and wJK[][], respectively. You can also assume that functions x(,), outI(,), outJ(,), out K(...), and o(.) are already implemented. 6a. A stochastic policy is implemented using the "soft-max" operator on the outputs of the neural network. Thus, for state s, the probability of selecting action k = [K] is eout K(s,k) k'E[K] eout K(s,k') Suppose the current policy is parameterised by weights wIJ[][] and wJK[][]. Say an episode s[0], a[0], r[0], s[1], a[1], r[1], s[2], ..., s[T] is generated by following , where s[T] is a terminal state. Write down pseudocode to perform a REINFORCE update with step size a. You can assume wIJ[][], wJK[][], T, s[], a[], r[], and a are already populated. Your code must terminate with wIJ[][] and w.JK[][] updated as per the REINFORCE rule at the end of the episode. Since you will need to compute a gradient for making the update, show the steps to work out the actual form of the gradient before presenting the pseudocode for the update. [6 marks]. 6b. Suppose the same neural network is used to approximate a value function for a prediction task. In this case we have a single output (that is, K = 1). For each state s, out K(s, 0) is interpreted as V(s). The aim is to drive V towards V" by making TD(0) updates, where is the policy being followed. Suppose the current approximation of V uses weights wIJ[][] and wJK[][]. Say the transition s, r, s' is observed. Write down pseudocode for the TD (0) update performed upon reaching s', with learning rate a and discount factory. Assume wIJ[][], wJK[][], s, r, s', a, and y are already populated. Your code must terminate with wIJ[][] and wJK[][] updated correctly. Here, too, show the steps to obtain the form of the gradient before using it in your pseudocode as a part of the learning update. Since K = 1, you can use wJK[] instead of wJK[][] if you would like, but the 2-dimensional variant is also okay, keeping the second index 0. [3 marks].

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

The detailed ... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Computer Engineering Questions!

Is the use of a single city test market appropriate? Discuss.

Use of a single cost driver rate when an indirect cost pool includes costs that have different cost drivers (causes of costs) leads to distortions in job costs. Do you agree with this statement?...

Consider the use of counter mode, as shown in Fig. 8-15, but with IV = 0. Does the use of 0 threaten the security of the cipher in general?

a. Consider the case of a single employee with estimated annual expenses of $400. Which plan is the cheapest? What is the total annual cost associated with this plan? b. For the analysis in (a),...

The Chicago North Shore & Milwaukee Railroad was an electric railway running between the cities in its corporate title. It had passenger cars as shown in the figure, which weighed 104.4 kip, had...

The international polling organization Ipsos reported data from a survey of 2000 randomly selected Canadians who carry debit cards (Canadian Account Habits Survey, July 24, 2006). Participants in...

Consider the following fictitious sales data (in thousands of dollars) for books sold both over the Internet and in physical retail establishments. Firms have numbers instead of names, and Firm 1...

Consider the system of a single football. If you kick it, is there a net force to accelerate the system? If a friend kicks it at the same time with an equal and opposite force, is there a net force...

Aereo Inc. sells a service that allows its paying subscribers to watch television programs over the Internet at about the same time as the programs are broadcast over the air. Most of these programs...

Consider the following fictitious sales data (in thousands of dollars) for both e-books and physi- cal books. Firms have numbers instead of names, and Firm 1 generates only e-book sales. Suppose that...

Find a least expensive route, in monthly lease charges, between the pairs of computer centers in Exercise 11 using the lease charges given in Figure 2. a) Boston and Los Angeles b) New York and San...

Question 1 Where does decreased water velocity occur in a river meander and what does it promote 1. pool 2. erosion 3. point bar deposition 4. riffle 5. yazoo stream 6. inside area of a meander 7....

LarCalc 1 1 1 5 . 4 . 0 3 6 . region. R : region bounded b y the graphs o f y = 1 - x 2 2 and y = 0

Repeat parts (b) and (c) of Problem 64, now assuming the battery remains connected while the slab is inserted. Data From Problem 64 (b) the stored energy, and (c) the force on the slab in terms of...

For Roche Inc., variable manufacturing overhead costs are expected to be $20,000 in the first quarter of 2017, with $5,000 increments in each of the remaining three quarters. Fixed overhead costs are...

Solve the given boundary-value problem. y'' + 2y' +2y = 0, y(0) = 1, y() = 1

Programming Exercise 22.8 stores the prime numbers in a file named PrimeNumbers.dat. Write a program that finds the number of prime numbers that are less than or equal to 10, 100, 1,000, 10,000,...

How does the employment/population ratio behave relative to the participation rate?

offer great comprehensiveness and conciseness and can actually measure performance. Predictive analytics Personal observations Written reports O Statistical analyses Oral reports

CableTech Bell Corporation (CTB) operates in the telecommunications industry. CTB has two divisions: the Phone Division and the Cable Service Division. The Phone Division manufactures telephones in...

Knowledge Management System Audit Checklist Fostering knowledge management Conduct a knowledge audit Develop a knowledge strategy aligned with the goals of the organisation Ensure that appropriate KM...

the belw (b) Compare and contrast the struct and union keywords in C, supplying an example of a situation where it would be more appropriate to use a union rather than a struct. (c) Explain the...

INTERNATIONAL REVIEW OF L AW C OMPUTERS & TECHNOLOGY , VOLUME 11, N UMBER 2, P AGES 251-261, 1997 The Data Mart: A New Approach to Data Warehousing PAM ELA PIPE Introduction Vendors have recently...

Our first analysis will be to satisfy Mrs. Adebambo's view of revenue by productline - so we have to produce a pie chart with the Cumulation of revenue by productline from Salesview Secondly we are...

2. Effective career counseling for Indigenous groups revolves upon the idea of cultural competence. Why is it crucial to acknowledge cultural diversity in this setting, how can career counselors...