随机游走001 | 什么是好的惩罚函数 (penalty function)？

本文主要是介绍随机游走001 | 什么是好的惩罚函数 (penalty function)？，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

Question

A good penalty function should result in an estimator with three properties:

Unbiasedness（无偏性）: The resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias.
Sparsity（稀疏性）: The resulting estimator is a thresholding rule, which automatically sets small estmated coefficient to zero to reduce model complexity.
Continuity（连续性）: The resulting estimator is continuous in data $z$ to avoid instability in model prediction.

Now you need to verify whether OLS, Ridge, LASSO, SCAD satisfy these preperties or not.

Answer

Conditions

Linear model:

\[\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\quad y_i=\beta_0+\sum\limits_{j=1}^p\beta_jx_{ij}+\varepsilon_i,i=1,\dots,n, \]

where $\mathbf{y}=(y_1,\dots,y_n)^\top,\mathbf{X}=(\mathbf{x}_0,\mathbf{x}_1,\dots,\mathbf{x}_n)^\top$,where $\mathbf{x}_0=(1,1,\dots,1)^\top,\mathbf{x}_i=(x_{i1},\dots,x_{ip})^\top,i=1,\dots,n$,and $\boldsymbol{\varepsilon}=(\varepsilon_1,\dots,\varepsilon_n)^\top$, $\boldsymbol{\beta}=(\beta_0,\beta_1,\dots,\beta_p)^\top$.

Now we first consider the ordinary least squre estimator (OLS):

\[\widehat{\boldsymbol{\beta}}^{\text{ols}}=\arg\min\limits_{\boldsymbol{\beta}}\sum_{i=1}^n\bigg(y_i-\beta_0-\sum\limits_{j=1}^p\beta_jx_{ij}\bigg)^2=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \]

we know that $\widehat{\boldsymbol{\beta}}^\text{ols}$ is unbiased, since

\[E(\widehat{\boldsymbol{\beta}}^\text{ols}-\boldsymbol{\beta})=E\big((\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon})-\boldsymbol{\beta}\big)=(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top E(\boldsymbol{\varepsilon})=\boldsymbol{0}. \]

And of course that $\widehat{\boldsymbol{\beta}}^{\text{ols}}$ is continuous in data $z$ and it doesn't have sparsity since no coefficient will be set to zero.

Now we consider the penalized least square regression model whose objective function is

\[\begin{align*} Q(\boldsymbol{\beta})&=\frac{1}{2}||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|)\\ &=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}+\hat{\mathbf{y}}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2+\lambda\sum_{j=1}^p p_j(|\beta_j|)\\&=\frac{1}{2}||\mathbf{y}-\hat{\mathbf{y}}||^2+\frac{1}{2}\sum_{j=1}^p(z_j-\beta_j)^2+\lambda\sum\limits_{j=1}^p p_j(|\beta_j|) \end{align*} \]

Noting that here we denote $\mathbf{z}=\mathbf{X}^\top\mathbf{y}$ and assume that the columns of $\mathbf{X}$ are orthonormal, which means $\mathbf{X}^\top\mathbf{X}=\mathbf{X}\mathbf{X}^\top=\mathbf{I}$, so that $\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{X}^\top\mathbf{y}$, $\hat{\mathbf{y}}=\mathbf{X}\widehat{\boldsymbol{\beta}}^{\text{ols}}=\mathbf{y}$, and

\[||\mathbf{Xz}-\mathbf{X}\boldsymbol{\beta}||^2=||\mathbf{z}||^2+||\boldsymbol{\beta}||^2-2\mathbf{z}^\top\boldsymbol{\beta}=||\mathbf{z}-\boldsymbol{\beta}||^2. \]

Thus, the minimization problem of penalized least squares is equivalent ot minimizing componentwise

\[Q(\theta)=\frac{1}{2}(z-\theta)^2+p_\lambda(|\theta|). \]

In order to get the minimizer of $Q(\theta)$,we let $\frac{dQ(\theta)}{d\theta}=0$ and have

\[(\theta-z)+\text{sgn}(\theta)p_\lambda^\prime(|\theta|)=\text{sgn}(\theta)\{|\theta|+p_\lambda^\prime(|\theta|)\}-z=0. \]

Here are some observations based on this equation:

When $p^\prime_\lambda(|\theta|)=0$ for large $|\theta|$, the resulting estimator is $z$ when $|z|$ is sufficently large, which is that $\hat{\theta}=z$.
In order to get sparsity, we hope $\hat{\theta}=0$ when $z$ is small, that is $0$ is the minimizer of $Q(\theta)$, which requaring

\[\begin{equation*} \begin{cases} \frac{dQ(\theta)}{d\theta}>0,& \text{when } \theta>0,\\ \frac{dQ(\theta)}{d\theta}<0,& \text{when } \theta<0, \end{cases} \iff \begin{cases} \theta+p^\prime_\lambda(|\theta|)>z,& \text{when } \theta>0,\\ -\big(\theta+p^\prime_\lambda(|\theta|)\big)<z,& \text{when } \theta<0, \end{cases} \end{equation*} \]

and this condition can be summarized into

\[\min\limits_{\theta\neq0}\{|\theta|+p_\lambda^\prime(|\theta|)\}>|z|. \]

From sparsity, we have $\hat{\theta}=0,$ if $|\theta|+p_\lambda^\prime(|\theta|)>|z|$. When $|\theta|+p^\prime_\lambda(|\theta|)=|z|$, we get a resulting estimator $\hat{\theta}=\theta_0$. For continuity, we need $\theta_0$ goes to zero, that is $\arg\min\{|\theta|+p^\prime_\lambda(|\theta|)\}=0.$

In conclusion, the conditions of three properties for a good estimator are:

Unbiasedness condition: $p_\lambda^\prime(|\theta|)=0$, for large $|\theta|$;
Sparsity condition: $\min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}>0$;
Continuity condition: $\arg\min\limits_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=0.$

Examples

Now we review the OLS estimator with $p_\lambda(|\theta|)=0$, it's obvious that

\[p_\lambda^\prime(|\theta|)\equiv0,\text{ and } \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}=\min_\theta\{|\theta|\}=\{|\theta|\}|_{\theta=0}=0. \]

Therefore, OLS satisfies unbiasedness and continuity while it does not satisfy sparsity.

Secnondly, we consider ridge regression with $p_\lambda(|\theta|)=\lambda|\theta|^2$, we can see that

\[\begin{align*} p_\lambda^\prime(|\theta|)&=2\lambda\theta\neq0, \,\, \text{for large }|\theta|,\\ \min_\theta\{|\theta|+p_\lambda^\prime(|\theta|)\}&=\min_\theta\{|\theta|+\lambda|\theta|^2\}=\{|\theta|+\lambda|\theta|^2\}|_{\theta=0}=0. \end{align*} \]

Therefore, ridge regression estimator satisfies continuity while it does not satisfy unbiasedness and sparsity.

Next, we consider LASSO regression with $p_\lambda(|\theta|)=\lambda|\theta|$. For large $|\theta|$, we have $$p_\lambda^\prime(|\theta|)=\lambda\text{sgn}(\theta)\neq0,,, \text{since } \lambda>0.$$ For $H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\text{sgn}(\theta)$,

\[\begin{equation*} \begin{cases} H^\prime(|\theta|)=1>0,& \text{when } \theta>0,\\ H^\prime(|\theta|)=-1<0,& \text{when } \theta<0, \end{cases} \end{equation*} \]

so that $\arg\min\limits_\theta H(|\theta|)=0$, and $\min H(|\theta|)=H(0)=\lambda>0$. Therefore, LASSO regression estimator satisfies sparsity and continuity while it does not satisfy unbiasedness.

Last, we consider SCAD with penalized function

\[ \begin{equation*} p_\lambda(|\theta|;a)=\begin{cases} \lambda|\theta|0,& \text{if } 0\leq\theta<\lambda,\\ -\frac{\theta^2-2a\lambda|\theta|+\lambda^2}{2(a-1)},& \text{if } \lambda\leq|\theta|<a\lambda<0,\\ (a+1)\lambda^2/2,&\text{otherwise}, \end{cases} \end{equation*} \]

where $a>1$. So that

\[\begin{align*} p_\lambda^\prime(\theta)&=\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\},\\ p^\prime_\lambda(\theta)&=\bigg(\frac{(a+1)\lambda^2}{2}\bigg)^\prime=0,\,\, \text{for large } |\theta|. \end{align*} \]

For $H(\theta)=|\theta|+p_\lambda^\prime(|\theta|)=|\theta|+\lambda\{I(\theta\leq\lambda)+\frac{(a\lambda-\theta)_+}{(a-1)\lambda}I(\theta>\lambda)\}$, we have

\[\begin{equation*} H^\prime(|\theta|)= \begin{cases} -1<0,& \text{when } \theta<0,\\ 1>0,& \text{when } 0<\theta \leq\lambda,\\ 1-\frac{1}{a-1}>0,&\text{when } \lambda<\theta\leq a\lambda,\\ 1>0,&\text{when } \theta>a\lambda, \end{cases} \end{equation*} \]

so that $\arg\min_\theta H(|\theta|)=0$, and $\min H(|\theta|)=H(0)=\lambda>0$. Therefore, SCAD estimator satisfies all the three properties.

Conclution

	OLS	Ridge	LASSO	SCAD
Unbiasedness	$\surd$	$\times$	$\times$	$\surd$
Sparsity	$\times$	$\times$	$\surd$	$\surd$
Continuity	$\surd$	$\surd$	$\surd$	$\surd$

Reference

[1] Fan, J. & Li, R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 2001, 96, 1348-1360.

这篇关于随机游走001 | 什么是好的惩罚函数 (penalty function)？的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

	OLS	Ridge	LASSO	SCAD
Unbiasedness	\(\surd\)	\(\times\)	\(\times\)	\(\surd\)
Sparsity	\(\times\)	\(\times\)	\(\surd\)	\(\surd\)
Continuity	\(\surd\)	\(\surd\)	\(\surd\)	\(\surd\)

C/C++教程