5.25 ( ) www Consider a quadratic error function of the form E = E0 + 1...

Question:

5.25 ( ) www Consider a quadratic error function of the form E = E0 +

1 2

(w − w)TH(w − w) (5.195)

where w represents the minimum, and the Hessian matrixHis positive definite and constant. Suppose the initial weight vector w(0) is chosen to be at the origin and is updated using simple gradient descent w(τ) = w(τ−1) − ρ∇E (5.196)

where τ denotes the step number, and ρ is the learning rate (which is assumed to be small). Show that, after τ steps, the components of the weight vector parallel to the eigenvectors of H can be written w(τ)

j = {1 − (1 − ρηj)τ }w

j (5.197)

where wj = wTuj , and uj and ηj are the eigenvectors and eigenvalues, respectively, of H so that Huj = ηjuj . (5.198)

Show that as τ → ∞, this gives w(τ) → w as expected, provided |1 − ρηj | < 1.

Now suppose that training is halted after a finite number τ of steps. Show that the components of the weight vector parallel to the eigenvectors of the Hessian satisfy w(τ)
j w
j when ηj (ρτ)−1 (5.199)
|w(τ)
j | |w
j | when ηj (ρτ)−1. (5.200)
Compare this result with the discussion in Section 3.5.3 of regularization with simple weight decay, and hence show that (ρτ)−1 is analogous to the regularization parameter λ. The above results also show that the effective number of parameters in the network, as defined by (3.91), grows as the training progresses.

Fantastic news! We've Found the answer you've been seeking!