Let us again consider a single hidden layer MLP with M hidden units Suppose we now have B input vectors packed as a matrix such that each row is a sample, i e , XRBN The hidden activations HRBW are computed as follows, Y XW bH (Y) where weight matrix WRMN, bias vector bRM1, and is the nonlinear activation function XW b leverages the broadcasting addition In particular, the shapes of XW and b are BM and 1M, respectively You can imagine that the broadcasting addition first duplicates b for B times to create a matrix of shape BM and then performs element wise addition Note that in the actual implementation, one does not perform duplication since it is a waste of memory This is just a helpful way to understand how broadcasting operations work in a high level Batch normalization (BN) 2 , is a method that makes training of deep neural networks faster and more stable via normalizing the inputs of individual layers by re centering and re scaling In particular, if we apply BN to pre activations 1Y in Eq (5), we have mv j Y i,j B1i 1BY i, B1i 1B(Y i,j m j )2 j v j Y i,j m j j , where the normalized input Y RBM,mRM1 and vRM1 are the mean and the (dimension wise) variance vectors RM1 and RM1 are learnable parameters associated with the BN layer is a hyperparameter 2 1 5pts Explain why we need the hyperparameter 2 2 10pts Derive the mean and the variance of Y i,j (you can ignore the effect of ) 2 3 20pts Suppose the nonlinear activation function is ReLU and the gradient of the training loss with respect to (w r t ) the normalized hidden activations H is H Derive the gradients of the training loss w r t learnable parameters and , the mean m, the variance v, and the input Y 2 4 Bonus 20pts BN only normalizes the hidden units in a dimension wise fashion, e g , the variance is computed per dimension Derive the equations of BN where the covariance matrix of the normalized input Y is normalized to an identity Compare this version of BN to the original BN and discuss the pros and cons

Question

Let us again consider a single hidden layer MLP with M hidden units  Suppose we now have B input vectors packed as a matrix such that each row is a sample, i e , XRBN  The hidden activations HRBW are computed as follows, Y XW bH (Y) where weight matrix WRMN, bias vector bRM1, and is the nonlinear activation function  XW b leverages the broadcasting addition  In particular, the shapes of XW and b are BM and 1M, respectively  You can imagine that the broadcasting addition first duplicates b for B times to create a matrix of shape BM and then performs element wise addition  Note that in the actual implementation, one does not perform duplication since it is a waste of memory  This is just a helpful way to understand how broadcasting operations work in a high level  Batch normalization (BN)  2 , is a method that makes training of deep neural networks faster and more stable via normalizing the inputs of individual layers by re centering and re scaling  In particular, if we apply BN to pre activations 1Y in Eq  (5), we have mv j Y  i,j  B1i 1BY i,   B1i 1B(Y i,j m j )2  j v j  Y i,j m j   j , where the normalized input Y RBM,mRM1 and vRM1 are the mean and the (dimension wise) variance vectors  RM1 and RM1 are learnable parameters associated with the BN layer  is a hyperparameter  2 1  5pts  Explain why we need the hyperparameter   2 2  10pts  Derive the mean and the variance of Y  i,j  (you can ignore the effect of )  2 3  20pts  Suppose the nonlinear activation function is ReLU and the gradient of the training loss with respect to (w r t ) the normalized hidden activations H is H  Derive the gradients of the training loss w r t  learnable parameters and , the mean m, the variance v, and the input Y  2 4  Bonus 20pts  BN only normalizes the hidden units in a dimension wise fashion, e g , the variance is computed per dimension  Derive the equations of BN where the covariance matrix of the normalized input Y  is normalized to an identity  Compare this version of BN to the original BN and discuss the pros and cons

Accepted Answer

The Answer is in the image, click to view ...

Question

Let us again consider a single hidden layer MLP with M hidden units. Suppose we now have B input vectors packed as a matrix such

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

What Is A Database And How Do I Use It

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question