Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Let us again consider a single hidden layer MLP with M hidden units. Suppose we now have B input vectors packed as a matrix such

image text in transcribed

Let us again consider a single hidden layer MLP with M hidden units. Suppose we now have B input vectors packed as a matrix such that each row is a sample, i.e., XRBN. The hidden activations HRBW are computed as follows, Y=XW+bH=(Y) where weight matrix WRMN, bias vector bRM1, and is the nonlinear activation function. XW+b leverages the broadcasting addition. In particular, the shapes of XW and b are BM and 1M, respectively. You can imagine that the broadcasting addition first duplicates b for B times to create a matrix of shape BM and then performs element-wise addition. Note that in the actual implementation, one does not perform duplication since it is a waste of memory. This is just a helpful way to understand how broadcasting operations work in a high level. Batch normalization (BN) [2], is a method that makes training of deep neural networks faster and more stable via normalizing the inputs of individual layers by re-centering and re-scaling. In particular, if we apply BN to pre-activations 1Y in Eq. (5), we have mv[j]Y^[i,j]=B1i=1BY[i,:]=B1i=1B(Y[i,j]m[j])2=[j]v[j]+Y[i,j]m[j]+[j], where the normalized input Y^RBM,mRM1 and vRM1 are the mean and the (dimension-wise) variance vectors. RM1 and RM1 are learnable parameters associated with the BN layer. is a hyperparameter. 2.1 [5pts] Explain why we need the hyperparameter . 2.2 [10pts] Derive the mean and the variance of Y^[i,j] (you can ignore the effect of ). 2.3 [20pts] Suppose the nonlinear activation function is ReLU and the gradient of the training loss with respect to (w.r.t.) the normalized hidden activations H is H. Derive the gradients of the training loss w.r.t. learnable parameters and , the mean m, the variance v, and the input Y. 2.4 [Bonus 20pts] BN only normalizes the hidden units in a dimension-wise fashion, e.g., the variance is computed per-dimension. Derive the equations of BN where the covariance matrix of the normalized input Y^ is normalized to an identity. Compare this version of BN to the original BN and discuss the pros and cons

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

What Is A Database And How Do I Use It

Authors: Matt Anniss

1st Edition

1622750799, 978-1622750795

More Books

Students also viewed these Databases questions

Question

5x / 4 = 15/8

Answered: 1 week ago

Question

=+ 9. Describe the role of prices in market economies.

Answered: 1 week ago

Question

Describe how language reflects, builds on, and determines context?

Answered: 1 week ago