(Exercise 18 continued.) Consider again Example 2.10 with (mathbf{D}=operatorname{diag}left(lambda_{1}, ldots, lambda_{p} ight)) for some nonnegative model-selection parameter...
Question:
(Exercise 18 continued.) Consider again Example 2.10 with \(\mathbf{D}=\operatorname{diag}\left(\lambda_{1}, \ldots, \lambda_{p}\right)\) for some nonnegative model-selection parameter \(\lambda \in \mathbb{R}^{p}\). A Bayesian choice for \(\lambda\) is the maximizer of the marginal likelihood \(g(\boldsymbol{y} \mid \lambda)\); that is,
\[ \boldsymbol{\lambda}^{*}=\underset{\boldsymbol{\lambda} \geqslant \mathbf{0}}{\operatorname{argmax}} \iint g\left(\boldsymbol{\beta}, \sigma^{2}, \boldsymbol{y} \mid \boldsymbol{\lambda}\right) \mathrm{d} \boldsymbol{\beta} \mathrm{d} \sigma^{2} \]
where
\[ \ln g\left(\boldsymbol{\beta}, \sigma^{2}, \boldsymbol{y} \mid \boldsymbol{\lambda}\right)=-\frac{\|\boldsymbol{y}-\mathbf{X} \boldsymbol{\beta}\|^{2}+\boldsymbol{\beta}^{\top} \mathbf{D}^{-1} \boldsymbol{\beta}}{2 \sigma^{2}}-\frac{1}{2} \ln |\boldsymbol{D}|-\frac{n+p}{2} \ln \left(2 \pi \sigma^{2}\right)-\ln \sigma^{2} \]
To maximize \(g(\boldsymbol{y} \mid \lambda)\), one can use the EM algorithm with \(\boldsymbol{\beta}\) and \(\sigma^{2}\) acting as latent variables in the complete-data log-likelihood \(\ln g\left(\boldsymbol{\beta}, \sigma^{2}, \boldsymbol{y} \mid \lambda\right)\). Define
\[ \begin{align*} & \boldsymbol{\Sigma}:=\left(\mathbf{D}^{-1}+\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \tag{6.42}\\ & \overline{\boldsymbol{\beta}}:=\boldsymbol{\Sigma} \mathbf{X}^{\top} \mathbf{y} \\ & \widehat{\sigma}^{2}:=\left(\|\mathbf{y}\|^{2}-\mathbf{y}^{\top} \mathbf{X} \overline{\boldsymbol{\beta}}\right) / n \end{align*} \]
(a) Show that the conditional density of the latent variables \(\boldsymbol{\beta}\) and \(\sigma^{2}\) is such that
\[ \begin{aligned} & \left(\sigma^{-2} \mid \boldsymbol{\lambda}, \boldsymbol{y}\right) \sim \operatorname{Gamma}\left(\frac{n}{2}, \frac{n}{2} \widehat{\sigma}^{2}\right) \\ & \left(\boldsymbol{\beta} \mid \boldsymbol{\lambda}, \sigma^{2}, \boldsymbol{y}\right) \sim \mathscr{N}\left(\overline{\boldsymbol{\beta}}, \sigma^{2} \boldsymbol{\Sigma}\right) \end{aligned} \]
(b) Use Theorem C. 2 to show that the expected complete-data log-likelihood is
\[ -\frac{\overline{\boldsymbol{\beta}}^{\top} \mathbf{D}^{-1} \overline{\boldsymbol{\beta}}}{2 \widehat{\sigma}^{2}}-\frac{\operatorname{tr}\left(\mathbf{D}^{-1} \boldsymbol{\Sigma}\right)+\ln |\mathbf{D}|}{2}+c_{1} \]
where \(c_{1}\) is a constant that does not depend on \(\lambda\).
(c) Use Theorem A. 2 to simplify the expected complete-data log-likelihood and to show that it is maximized at \(\lambda_{i}=\sum i i+\left(\bar{\beta}_{i} / \widehat{\sigma}\right)^{2}\) for \(i=1, \ldots, p\). Hence, deduce the following \(\mathrm{E}\) and \(\mathrm{M}\) steps in the EM algorithm:
E-step. Given \(\lambda\), update \(\sum, \bar{\beta}, \widehat{\sigma}^{2}\) via the formulas (6.42).
M-step. Given \(\sum, \bar{\beta}, \widehat{\sigma}^{2}\), update \(\lambda\) via \(\lambda_{i}=\sum_{i i}+\left(\bar{\beta}_{i} / \widehat{\sigma}^{2}\right)^{2}, i=1, \ldots, p\).
(d) Write Python code to compute \(\lambda^{*}\) via the EM algorithm, and use it to select the best polynomial model in Example 2.10. A possible stopping criterion is to terminate the EM iterations when
\[ \ln g\left(\boldsymbol{y} \mid \boldsymbol{\lambda}_{t+1}\right)-\ln g\left(\boldsymbol{y} \mid \boldsymbol{\lambda}_{t}\right)<\varepsilon \]
for some small \(\varepsilon>0\), where the marginal \(\log\)-likelihood is
\[ \ln g(\boldsymbol{y} \mid \boldsymbol{\lambda})=-\frac{n}{2} \ln \left(n \pi \widehat{\sigma}^{2}\right)-\frac{1}{2} \ln |\mathbf{D}|+\frac{1}{2} \ln |\boldsymbol{\Sigma}|+\ln \Gamma(n / 2) \]
Step by Step Answer:
Data Science And Machine Learning Mathematical And Statistical Methods
ISBN: 9781118710852
1st Edition
Authors: Dirk P. Kroese, Thomas Taimre, Radislav Vaisman, Zdravko Botev