XGBoost官方文档
XGBoost本质上还是一个GBDT,是一个优化的分布式梯度增强库,旨在实现高效,灵活和便携。Xgboost以CART决策树为子模型,通过Gradient Tree Boosting实现多棵CART树的集成学习,得到最终模型。
XGBoost的最终模型构建:
引用陈天奇的论文,我们的数据为:$\mathcal{D}=\left{\left(\mathbf{x}{i}, y{i}\right)\right}\left(|\mathcal{D}|=n, \mathbf{x}{i} \in \mathbb{R}^{m}, y{i} \in \mathbb{R}\right)$
(1) 构造目标函数:
假设有K棵树,则第i个样本的输出为$\hat{y}{i}=\phi\left(\mathrm{x}{i}\right)=\sum_{k=1}^{K} f_{k}\left(\mathrm{x}{i}\right), \quad f{k} \in \mathcal{F}$,其中,$\mathcal{F}=\left{f(\mathbf{x})=w_{q(\mathbf{x})}\right}\left(q: \mathbb{R}^{m} \rightarrow T, w \in \mathbb{R}^{T}\right)$
因此,目标函数的构建为:
$$
\mathcal{L}(\phi)=\sum_{i} l\left(\hat{y}{i}, y{i}\right)+\sum_{k} \Omega\left(f_{k}\right)
$$
其中,$\sum_{i} l\left(\hat{y}{i}, y{i}\right)$为loss function,$\sum_{k} \Omega\left(f_{k}\right)$为正则化项。
(2) 叠加式的训练(Additive Training):
给定样本$x_i$,$\hat{y}i^{(0)} = 0$(初始预测),$\hat{y}i^{(1)} = \hat{y}i^{(0)} + f_1(x_i)$,$\hat{y}i^{(2)} = \hat{y}i^{(0)} + f_1(x_i) + f_2(x_i) = \hat{y}i^{(1)} + f_2(x_i)$.......以此类推,可以得到:$$\hat{y}i^{(K)} = \hat{y}i^{(K-1)} + f_K(x_i)$$ 其中,$\hat{y}i^{(K-1)}$ 为前K-1棵树的预测结果,$f_K(x_i)$ 为第K棵树的预测结果。
因此,目标函数可以分解为:
$$
\mathcal{L}{(K)}=\sum_{i=1}{n} l\left(y{i}, \hat{y}{i}^{(K-1)}+f{K}\left(\mathrm{x}{i}\right)\right)+\sum{k} \Omega\left(f{k}\right)
$$
由于正则化项也可以分解为前K-1棵树的复杂度加第K棵树的复杂度,因此:$$\mathcal{L}{(K)}=\sum_{i=1}{n} l\left(y{i}, \hat{y}{i}^{(K-1)}+f{K}\left(\mathrm{x}{i}\right)\right)+\sum{k=1} ^{K-1}\Omega\left(f_{k}\right)+\Omega\left(f_{K}\right)$$由于$\sum_{k=1} ^{K-1}\Omega\left(f_{k}\right)$在模型构建到第K棵树的时候已经固定,无法改变,因此是一个已知的常数,可以在最优化的时候省去,故:
$$
\mathcal{L}{(K)}=\sum_{i=1}{n} l\left(y_{i}, \hat{y}{i}^{(K-1)}+f{K}\left(\mathrm{x}{i}\right)\right)+\Omega\left(f{K}\right)
$$
(3) 使用泰勒级数近似目标函数:
$$
\mathcal{L}^{(K)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}^{(K-1)}\right)+g_{i} f_{K}\left(\mathrm{x}{i}\right)+\frac{1}{2} h{i} f_{K}^{2}\left(\mathrm{x}{i}\right)\right]+\Omega\left(f{K}\right)
$$
其中,$g_{i}=\partial_{\hat{y}(t-1)} l\left(y_{i}, \hat{y}{(t-1)}\right)$和$h_{i}=\partial_{\hat{y}{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right)$
在这里,我们补充下泰勒级数的相关知识:
在数学中,泰勒级数(英语:Taylor series)用无限项连加式——级数来表示一个函数,这些相加的项由函数在某一点的导数求得。具体的形式如下:
$$
f(x)=\frac{f\left(x_{0}\right)}{0 !}+\frac{f^{\prime}\left(x_{0}\right)}{1 !}\left(x-x_{0}\right)+\frac{f^{\prime \prime}\left(x_{0}\right)}{2 !}\left(x-x_{0}\right){2}+\ldots+\frac{f{(n)}\left(x_{0}\right)}{n !}\left(x-x_{0}\right)^{n}+......
$$
由于$\sum_{i=1}^{n}l\left(y_{i}, \hat{y}^{(K-1)}\right)$在模型构建到第K棵树的时候已经固定,无法改变,因此是一个已知的常数,可以在最优化的时候省去,故:
$$
\tilde{\mathcal{L}}{(K)}=\sum_{i=1}{n}\left[g_{i} f_{K}\left(\mathbf{x}{i}\right)+\frac{1}{2} h{i} f_{K}^{2}\left(\mathbf{x}{i}\right)\right]+\Omega\left(f{K}\right)
$$
(4) 如何定义一棵树:
为了说明如何定义一棵树的问题,我们需要定义几个概念:
$q(x_1) = 1,q(x_2) = 3,q(x_3) = 1,q(x_4) = 2,q(x_5) = 3$
$I_1 = {1,3},I_2 = {4},I_3 = {2,5}$,$w = (15,12,20)$
因此,目标函数用以上符号替代后:
$$
\begin{aligned}
\tilde{\mathcal{L}}^{(K)} &=\sum_{i=1}^{n}\left[g_{i} f_{K}\left(\mathrm{x}{i}\right)+\frac{1}{2} h{i} f_{K}^{2}\left(\mathrm{x}{i}\right)\right]+\gamma T+\frac{1}{2} \lambda \sum{j=1}^{T} w_{j}^{2} \
&=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T
\end{aligned}
$$
由于我们的目标就是最小化目标函数,现在的目标函数化简为一个关于w的二次函数:$$\tilde{\mathcal{L}}{(K)}=\sum_{j=1}{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T$$根据二次函数求极值的公式:$y=ax^2 +bx +c$求极值,对称轴在$x=-\frac{b}{2 a}$,极值为$y=\frac{4 a c-b^{2}}{4 a}$,因此:
$$
w_{j}^{*}=-\frac{\sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i}+\lambda}
$$
以及
$$
\tilde{\mathcal{L}}^{(K)}(q)=-\frac{1}{2} \sum_{j=1}^{T} \frac{\left(\sum_{i \in I_{j}} g_{i}\right)^{2}}{\sum_{i \in I_{j}} h_{i}+\lambda}+\gamma T
$$
(5) 如何寻找树的形状:
不难发现,刚刚的讨论都是基于树的形状已经确定了计算$w$和$L$,但是实际上我们需要像学习决策树一样找到树的形状。因此,我们借助决策树学习的方式,使用目标函数的变化来作为分裂节点的标准。我们使用一个例子来说明:
例子中有8个样本,分裂方式如下,因此:
$$
\tilde{\mathcal{L}}^{(old)} = -\frac{1}{2}[\frac{(g_7 + g_8)^2}{H_7+H_8 + \lambda} + \frac{(g_1 +...+ g_6)^2}{H_1+...+H_6 + \lambda}] + 2\gamma \
\tilde{\mathcal{L}}^{(new)} = -\frac{1}{2}[\frac{(g_7 + g_8)^2}{H_7+H_8 + \lambda} + \frac{(g_1 +...+ g_3)^2}{H_1+...+H_3 + \lambda} + \frac{(g_4 +...+ g_6)^2}{H_4+...+H_6 + \lambda}] + 3\gamma\
\tilde{\mathcal{L}}^{(old)} - \tilde{\mathcal{L}}^{(new)} = \frac{1}{2}[ \frac{(g_1 +...+ g_3)^2}{H_1+...+H_3 + \lambda} + \frac{(g_4 +...+ g_6)^2}{H_4+...+H_6 + \lambda} - \frac{(g_1+...+g_6)^2}{h_1+...+h_6+\lambda}] - \gamma
$$
因此,从上面的例子看出:分割节点的标准为$max{\tilde{\mathcal{L}}^{(old)} - \tilde{\mathcal{L}}^{(new)} }$,即:
$$
\mathcal{L}{\text {split }}=\frac{1}{2}\left[\frac{\left(\sum{i \in I_{L}} g_{i}\right)^{2}}{\sum_{i \in I_{L}} h_{i}+\lambda}+\frac{\left(\sum_{i \in I_{R}} g_{i}\right)^{2}}{\sum_{i \in I_{R}} h_{i}+\lambda}-\frac{\left(\sum_{i \in I} g_{i}\right)^{2}}{\sum_{i \in I} h_{i}+\lambda}\right]-\gamma
$$
基于直方图的近似算法,可以更高效地选 择最优特征及切分点。主要思想是:
基于直方图的近似算法的计算过程如下所示:
下面用一个例子说明基于直方图的近似算法:
假设有一个年龄特征,其特征的取值为18、19、21、31、36、37、55、57,我们需要使用近似算法找到年龄这个特征的最佳分裂点:
近似算法实现了两种候选切分点的构建策略:全局策略和本地策略。
# XGBoost原生工具库的上手: import xgboost as xgb # 引入工具库 # read in data dtrain = xgb.DMatrix('demo/data/agaricus.txt.train') # XGBoost的专属数据格式,但是也可以用dataframe或者ndarray dtest = xgb.DMatrix('demo/data/agaricus.txt.test') # # XGBoost的专属数据格式,但是也可以用dataframe或者ndarray # specify parameters via map param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' } # 设置XGB的参数,使用字典形式传入 num_round = 2 # 使用线程数 bst = xgb.train(param, dtrain, num_round) # 训练 # make prediction preds = bst.predict(dtest) # 预测
XGBoost的参数设置(括号内的名称为sklearn接口对应的参数名字):
推荐博客:
推荐官方文档
XGBoost的参数分为三种:
通用参数:(两种类型的booster,因为tree的性能比线性回归好得多,因此我们很少用线性回归。)
任务参数(这个参数用来控制理想的优化目标和每一步结果的度量方法。)
命令行参数(这里不说了,因为很少用命令行控制台版本)
参数调优的一般步骤:
具体的api请查看:https://xgboost.readthedocs.io/en/latest/python/python_api.html
推荐github:https://github.com/dmlc/xgboost/tree/master/demo/guide-pytho
请查看datawhale《集成学习Boosting》
LightGBM也是像XGBoost一样,是一类集成算法,他跟XGBoost总体来说是一样的,算法本质上与Xgboost没有出入,只是在XGBoost的基础上进行了优化:
LightGBM的优点:
1)更快的训练效率
2)低内存使用
3)更高的准确率
4)支持并行化学习
LightGBM参数说明: 推荐文档1、推荐文档2
LightGBM与网格搜索结合调参,参考代码:
1.核心参数:(括号内名称是别名)
2.用于控制模型学习过程的参数:
3.度量参数:
4.GPU 参数:
import lightgbm as lgb from sklearn import metrics from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split canceData=load_breast_cancer() X=canceData.data y=canceData.target X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0,test_size=0.2) ### 数据转换 print('数据转换') lgb_train = lgb.Dataset(X_train, y_train, free_raw_data=False) lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,free_raw_data=False) ### 设置初始参数--不含交叉验证参数 print('设置参数') params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'nthread':4, 'learning_rate':0.1 } ### 交叉验证(调参) print('交叉验证') max_auc = float('0') best_params = {} # 准确率 print("调参1:提高准确率") for num_leaves in range(5,100,5): for max_depth in range(3,8,1): params['num_leaves'] = num_leaves params['max_depth'] = max_depth cv_results = lgb.cv( params, lgb_train, seed=1, nfold=5, metrics=['auc'], early_stopping_rounds=10, verbose_eval=True ) mean_auc = pd.Series(cv_results['auc-mean']).max() boost_rounds = pd.Series(cv_results['auc-mean']).idxmax() if mean_auc >= max_auc: max_auc = mean_auc best_params['num_leaves'] = num_leaves best_params['max_depth'] = max_depth if 'num_leaves' and 'max_depth' in best_params.keys(): params['num_leaves'] = best_params['num_leaves'] params['max_depth'] = best_params['max_depth'] # 过拟合 print("调参2:降低过拟合") for max_bin in range(5,256,10): for min_data_in_leaf in range(1,102,10): params['max_bin'] = max_bin params['min_data_in_leaf'] = min_data_in_leaf cv_results = lgb.cv( params, lgb_train, seed=1, nfold=5, metrics=['auc'], early_stopping_rounds=10, verbose_eval=True ) mean_auc = pd.Series(cv_results['auc-mean']).max() boost_rounds = pd.Series(cv_results['auc-mean']).idxmax() if mean_auc >= max_auc: max_auc = mean_auc best_params['max_bin']= max_bin best_params['min_data_in_leaf'] = min_data_in_leaf if 'max_bin' and 'min_data_in_leaf' in best_params.keys(): params['min_data_in_leaf'] = best_params['min_data_in_leaf'] params['max_bin'] = best_params['max_bin'] print("调参3:降低过拟合") for feature_fraction in [0.6,0.7,0.8,0.9,1.0]: for bagging_fraction in [0.6,0.7,0.8,0.9,1.0]: for bagging_freq in range(0,50,5): params['feature_fraction'] = feature_fraction params['bagging_fraction'] = bagging_fraction params['bagging_freq'] = bagging_freq cv_results = lgb.cv( params, lgb_train, seed=1, nfold=5, metrics=['auc'], early_stopping_rounds=10, verbose_eval=True ) mean_auc = pd.Series(cv_results['auc-mean']).max() boost_rounds = pd.Series(cv_results['auc-mean']).idxmax() if mean_auc >= max_auc: max_auc=mean_auc best_params['feature_fraction'] = feature_fraction best_params['bagging_fraction'] = bagging_fraction best_params['bagging_freq'] = bagging_freq if 'feature_fraction' and 'bagging_fraction' and 'bagging_freq' in best_params.keys(): params['feature_fraction'] = best_params['feature_fraction'] params['bagging_fraction'] = best_params['bagging_fraction'] params['bagging_freq'] = best_params['bagging_freq'] print("调参4:降低过拟合") for lambda_l1 in [1e-5,1e-3,1e-1,0.0,0.1,0.3,0.5,0.7,0.9,1.0]: for lambda_l2 in [1e-5,1e-3,1e-1,0.0,0.1,0.4,0.6,0.7,0.9,1.0]: params['lambda_l1'] = lambda_l1 params['lambda_l2'] = lambda_l2 cv_results = lgb.cv( params, lgb_train, seed=1, nfold=5, metrics=['auc'], early_stopping_rounds=10, verbose_eval=True ) mean_auc = pd.Series(cv_results['auc-mean']).max() boost_rounds = pd.Series(cv_results['auc-mean']).idxmax() if mean_auc >= max_auc: max_auc=mean_auc best_params['lambda_l1'] = lambda_l1 best_params['lambda_l2'] = lambda_l2 if 'lambda_l1' and 'lambda_l2' in best_params.keys(): params['lambda_l1'] = best_params['lambda_l1'] params['lambda_l2'] = best_params['lambda_l2'] print("调参5:降低过拟合2") for min_split_gain in [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]: params['min_split_gain'] = min_split_gain cv_results = lgb.cv( params, lgb_train, seed=1, nfold=5, metrics=['auc'], early_stopping_rounds=10, verbose_eval=True ) mean_auc = pd.Series(cv_results['auc-mean']).max() boost_rounds = pd.Series(cv_results['auc-mean']).idxmax() if mean_auc >= max_auc: max_auc=mean_auc best_params['min_split_gain'] = min_split_gain if 'min_split_gain' in best_params.keys(): params['min_split_gain'] = best_params['min_split_gain'] print(best_params)
{'bagging_fraction': 0.7, 'bagging_freq': 30, 'feature_fraction': 0.8, 'lambda_l1': 0.1, 'lambda_l2': 0.0, 'max_bin': 255, 'max_depth': 4, 'min_data_in_leaf': 81, 'min_split_gain': 0.1, 'num_leaves': 10}