Consider Example 2.10 with (mathbf{D}=operatorname{diag}left(lambda_{1}, ldots, lambda_{p} ight)) for some nonnegative vector (lambda in mathbb{R}^{p}), so that
Question:
Consider Example 2.10 with \(\mathbf{D}=\operatorname{diag}\left(\lambda_{1}, \ldots, \lambda_{p}\right)\) for some nonnegative vector \(\lambda \in \mathbb{R}^{p}\), so that twice the negative logarithm of the model evidence can be written as
\[ -2 \ln g(\boldsymbol{y})=l(\boldsymbol{\lambda}):=n \ln \left[\mathbf{y}^{\top}\left(\mathbf{I}-\mathbf{X} \boldsymbol{\Sigma} \mathbf{X}^{\top}\right) \mathbf{y}\right]+\ln |\mathbf{D}|-\ln |\Sigma|+c \]
where \(c\) is a constant that depends only on \(n\).
(a) Use the Woodbury identities (A.15) and (A.16) to show that
\[ \begin{aligned} & \mathbf{I}-\mathbf{X} \boldsymbol{\Sigma} \mathbf{X}^{\top}=\left(\mathbf{I}+\mathbf{X} \mathbf{D} \mathbf{X}^{\top}\right)^{-1} \\ & \ln |\mathbf{D}|-\ln |\Sigma|=\ln \left|\mathbf{I}+\mathbf{X} \mathbf{D} \mathbf{X}^{\top}\right| \end{aligned} \]
Deduce that \(l(\lambda)=n \ln \left[\boldsymbol{y}^{\top} \mathbf{C} \boldsymbol{y}\right]-\ln |\mathbf{C}|+c\), where \(\mathbf{C}:=\left(\mathbf{I}+\mathbf{X D X}^{\top}\right)^{-1}\).
(b) Let \(\left[\boldsymbol{v}_{1}, \ldots, \boldsymbol{v}_{p}\right]:=\mathbf{X}\) denote the \(p\) columns/predictors of \(\mathbf{X}\). Show that
\[ \mathbf{C}^{-1}=\mathbf{I}+\sum_{k=1}^{p} \lambda_{k} \boldsymbol{v}_{k} \boldsymbol{v}_{k}^{\top} \]
Explain why setting \(\lambda_{k}=0\) has the effect of excluding the \(k\)-th predictor from the regression model. How can this observation be used for model selection?
(c) Prove the following formulas for the gradient and Hessian elements of \(l(\lambda)\) :
\[ \begin{align*} & \frac{\partial l}{\partial \lambda_{i}}=\boldsymbol{v}_{i}^{\top} \mathbf{C} \boldsymbol{v}_{i}-n \frac{\left(\boldsymbol{v}_{i}^{\top} \mathbf{C} \boldsymbol{y}\right)^{2}}{\boldsymbol{y}^{\top} \mathbf{C} \boldsymbol{y}} \tag{6.41}\\ & \frac{\partial^{2} l}{\partial \lambda_{i} \partial \lambda_{j}}=(n-1)\left(\boldsymbol{v}_{i}^{\top} \mathbf{C} \boldsymbol{v}_{j}\right)^{2}-n\left[\boldsymbol{v}_{i}^{\top} \mathbf{C} \boldsymbol{v}_{j}-\frac{\left(\boldsymbol{v}_{i}^{\top} \mathbf{C} \boldsymbol{y}\right)\left(\boldsymbol{v}_{j}^{\top} \mathbf{C} \boldsymbol{y}\right)}{\boldsymbol{y}^{\top} \mathbf{C} \boldsymbol{y}}\right] \end{align*} \]
(d) One method to determine which predictors in \(\mathbf{X}\) are important is to compute \[ \lambda^{*}:=\underset{\lambda \geqslant 0}{\operatorname{argmin}} l(\lambda) \]
using, for example, the interior-point minimization Algorithm B.4.1 with gradient and Hessian computed from (6.41). Write Python code to compute \(\lambda^{*}\) and use it to select the best polynomial model in Example 2. 10.
Step by Step Answer:
Data Science And Machine Learning Mathematical And Statistical Methods
ISBN: 9781118710852
1st Edition
Authors: Dirk P. Kroese, Thomas Taimre, Radislav Vaisman, Zdravko Botev