A good penalty function should result in an estimator with three properties:
Unbiasedness(无偏性): The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias.
Sparsity(稀疏性): The resulting estimator is a thresholding rule, which automatically sets small estmated coefficient to zero to reduce model complexity.
Continuity(连续性): The resulting estimator is continuous in data \(z\) to avoid instability in model prediction.
Now you need to verify whether OLS, Ridge, LASSO, SCAD satisfy these preperties or not.
Linear model:
\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\quad y_i=\beta_0+\sum\limits_{j=1}^p\beta_jx_{ij}+\varepsilon_i,i=1,\dots,n, \]where \(\mathbf{y}=(y_1,\dots,y_n)^\top,\mathbf{X}=(\mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_n)^\top\),where \(\mathbf{x}_0=(1,1,\dots,1)^\top,\mathbf{x}_i=(x_{i1},\dots,x_{ip})^\top,i=1,\dots,n\),and \(\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top\), \(\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_p)^\top\).
Now we first consider the ordinary least squre estimator (OLS):
\[\widehat{\boldsymbol{\beta}}^{\text{ols}}=\arg\min\limits_{\boldsymbol{\beta}}\sum_{i=1}^n\bigg(y_i-\beta_0-\sum\limits_{j=1}^p\beta_jx_{ij}\bigg)^2=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \]we know that \(\widehat{\boldsymbol{\beta}}^\text{ols}\) is unbiased, since
\[E(\widehat{\boldsymbol{\beta}}^\text{ols}-\boldsymbol{\beta})=E\big((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\big)=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top E(\boldsymbol{\varepsilon})=\boldsymbol{0}. \]And of course that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}\) is continuous in data \(z\) and it doesn't have sparsity since no coefficient will be set to zero.
Now we consider the penalized least square regression model whose objective function is
\[\begin{align*} Q(\boldsymbol{\beta})&=\frac{1}{2}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|)\\ &=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}+\hat{\mathbf{y}}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}\sum_{j=1}^p(z_j-\beta_j)^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|) \end{align*} \]Noting that here we denote \(\mathbf{z}=\mathbf{X}^\top\mathbf{y}\) and assume that the columns of \(\mathbf{X}\) are orthonormal, which means \(\mathbf{X}^\top\mathbf{X}=\mathbf{X}\mathbf{X}^\top=\mathbf{I}\), so that \(\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{X}^\top\mathbf{y}\), \(\hat{\mathbf{y}}=\mathbf{X}\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{y}\), and
\[||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2=||\mathbf{z}||^2+||\boldsymbol{\beta}||^2-2\mathbf{z}^\top\boldsymbol{\beta}=||\mathbf{z}-\boldsymbol{\beta}||^2. \]Thus, the minimization problem of penalized least squares is equivalent ot minimizing componentwise
\[Q(\theta)=\frac{1}{2}(z-\theta)^2+p_\lambda(|\theta|). \]In order to get the minimizer of \(Q(\theta)\),we let \(\frac{dQ(\theta)}{d\theta}=0\) and have
\[(\theta-z)+\text{sgn}(\theta)p_\lambda^\prime(|\theta|)=\text{sgn}(\theta)\{|\theta|+p_\lambda^\prime(|\theta|)\}-z=0. \]Here are some observations based on this equation:
and this condition can be summarized into
\[\min\limits_{\theta\neq0}\{|\theta|+p_\lambda^\prime(|\theta|)\}>|z|. \]In conclusion, the conditions of three properties for a good estimator are:
Now we review the OLS estimator with \(p_\lambda(|\theta|)=0\), it's obvious that
\[p_\lambda^\prime(|\theta|)\equiv0,\text{ and } \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=\min_\theta\{|\theta|\}=\{|\theta|\}|_{\theta=0}=0. \]Therefore, OLS satisfies unbiasedness and continuity while it does not satisfy sparsity.
Secnondly, we consider ridge regression with \(p_\lambda(|\theta|)=\lambda|\theta|^2\), we can see that
\[\begin{align*} p_\lambda^\prime(|\theta|)&=2\lambda\theta\neq0, \,\, \text{for large }|\theta|,\\ \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}&=\min_\theta\{|\theta|+\lambda|\theta|^2\}=\{|\theta|+\lambda|\theta|^2\}|_{\theta=0}=0. \end{align*} \]Therefore, ridge regression estimator satisfies continuity while it does not satisfy unbiasedness and sparsity.
Next, we consider LASSO regression with \(p_\lambda(|\theta|)=\lambda|\theta|\). For large \(|\theta|\), we have $$p_\lambda^\prime(|\theta|)=\lambda\text{sgn}(\theta)\neq0,,, \text{since } \lambda>0.$$ For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\text{sgn}(\theta)\),
\[\begin{equation*} \begin{cases} H^\prime(|\theta|)=1>0,& \text{when } \theta>0,\\ H^\prime(|\theta|)=-1<0,& \text{when } \theta<0, \end{cases} \end{equation*} \]so that \(\arg\min\limits_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, LASSO regression estimator satisfies sparsity and continuity while it does not satisfy unbiasedness.
Last, we consider SCAD with penalized function
\[ \begin{equation*} p_\lambda(|\theta|;a)=\begin{cases} \lambda|\theta|0,& \text{if } 0\leq\theta<\lambda,\\ -\frac{\theta^2-2a\lambda|\theta|+\lambda^2}{2(a-1)},& \text{if } \lambda\leq|\theta|<a\lambda<0,\\ (a+1)\lambda^2/2,&\text{otherwise}, \end{cases} \end{equation*} \]where \(a>1\). So that
\[\begin{align*} p_\lambda^\prime(\theta)&=\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\},\\ p^\prime_\lambda(\theta)&=\bigg(\frac{(a+1)\lambda^2}{2}\bigg)^\prime=0,\,\, \text{for large } |\theta|. \end{align*} \]For \(H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\}\), we have
\[\begin{equation*} H^\prime(|\theta|)= \begin{cases} -1<0,& \text{when } \theta<0,\\ 1>0,& \text{when } 0<\theta \leq\lambda,\\ 1-\frac{1}{a-1}>0,&\text{when } \lambda<\theta\leq a\lambda,\\ 1>0,&\text{when } \theta>a\lambda, \end{cases} \end{equation*} \]so that \(\arg\min_\theta H(|\theta|)=0\), and \(\min H(|\theta|)=H(0)=\lambda>0\). Therefore, SCAD estimator satisfies all the three properties.
OLS | Ridge | LASSO | SCAD | |
---|---|---|---|---|
Unbiasedness | \(\surd\) | \(\times\) | \(\times\) | \(\surd\) |
Sparsity | \(\times\) | \(\times\) | \(\surd\) | \(\surd\) |
Continuity | \(\surd\) | \(\surd\) | \(\surd\) | \(\surd\) |
[1] Fan, J. & Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 2001, 96, 1348-1360.