Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Let us again consider a single hidden layer MLP with M hidden units. Suppose we now have B input vectors packed as a matrix such
Let us again consider a single hidden layer MLP with M hidden units. Suppose we now have B input vectors packed as a matrix such that each row is a sample, i.e., XRBN. The hidden activations HRBW are computed as follows, Y=XW+bH=(Y) where weight matrix WRMN, bias vector bRM1, and is the nonlinear activation function. XW+b leverages the broadcasting addition. In particular, the shapes of XW and b are BM and 1M, respectively. You can imagine that the broadcasting addition first duplicates b for B times to create a matrix of shape BM and then performs element-wise addition. Note that in the actual implementation, one does not perform duplication since it is a waste of memory. This is just a helpful way to understand how broadcasting operations work in a high level. Batch normalization (BN) [2], is a method that makes training of deep neural networks faster and more stable via normalizing the inputs of individual layers by re-centering and re-scaling. In particular, if we apply BN to pre-activations 1Y in Eq. (5), we have mv[j]Y^[i,j]=B1i=1BY[i,:]=B1i=1B(Y[i,j]m[j])2=[j]v[j]+Y[i,j]m[j]+[j], where the normalized input Y^RBM,mRM1 and vRM1 are the mean and the (dimension-wise) variance vectors. RM1 and RM1 are learnable parameters associated with the BN layer. is a hyperparameter. 2.1 [5pts] Explain why we need the hyperparameter . 2.2 [10pts] Derive the mean and the variance of Y^[i,j] (you can ignore the effect of ). 2.3 [20pts] Suppose the nonlinear activation function is ReLU and the gradient of the training loss with respect to (w.r.t.) the normalized hidden activations H is H. Derive the gradients of the training loss w.r.t. learnable parameters and , the mean m, the variance v, and the input Y. 2.4 [Bonus 20pts] BN only normalizes the hidden units in a dimension-wise fashion, e.g., the variance is computed per-dimension. Derive the equations of BN where the covariance matrix of the normalized input Y^ is normalized to an identity. Compare this version of BN to the original BN and discuss the pros and cons
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started