Adam (page 340) was described as a combination of momentum and RMS-Prop. Using AIPython (aipython.org), Keras, or

Question:

Adam (page 340) was described as a combination of momentum and RMS-Prop. Using AIPython (aipython.org), Keras, or PyTorch (see Appendix B.2), find two datasets and compare the following:

(a) How does Adam with β1 = β2 = 0, differ from plain stochastic gradient descent without momentum? [Hint: How does setting β1 = β2 = 0 simplify Adam, considering first the case where g?] Which works better on the datasets selected?

(b) How does Adam with β2 = 0 differ from stochastic gradient descent, when the α momentum parameter is equal to β1 in Adam? [Hint: How does setting β2 = 0 simplify Adam, considering first the case where g?] Which works better on the datasets selected?

(c) How does Adam with β1 = 0 differ from RMS-Prop, where the ρ parameter in RMS-Prop is equal to β2 in Adam? Which works better on the datasets selected?

Fantastic news! We've Found the answer you've been seeking!