一、Ensemble Technique 集合技术
An ensemble is a collection of predictors. For example, instead of using a single model (say, logistic regression) for a classification problem, we can use multiple models (say, logistic regression + decision trees, etc.) to perform predictions.
集合是一个预测器的集合。
例如,我们可以使用多个模型(如逻辑回归+决策树等)来进行预测,而不是使用单一模型(如逻辑回归)来处理分类问题。
The outputs from the predictors are combined by different averaging methods, such as weighted averages, normal averages, or votes, and a final prediction value is derived. Ensemble methods have been proved to be more effective than individual methods and, therefore, are heavily used to build machine learning models.
预测器的输出通过不同的平均方法进行组合,如加权平均、正常平均或投票,并得出最终预测值。集合方法已被证明比单个方法更有效,因此,被大量用于建立机器学习模型。
Ensemble methods can be implemented by either bagging or boosting.
集成方法可以通过 bagging 或 boosting 来实现。
二、Bagging
Bagging和Boosting的区别
Bagging:训练集是在原始集中有放回选取的,从原始集中选出的各轮训练集之间是独立的。
Boosting:每一轮的训练集不变,只是训练集中每个样例在分类器中的权重发生变化。 而权值是根据上一轮的分类结果进行调整。 Boosting:根据错误率不断调整样例的权值,错误率越大则权重越大。
Bagging is a technique wherein we build independent models/predictors, using a random subsample/bootstrap of data for each of the models/predictors.
Bagging是一种技术,我们建立独立的模型/预测器,对每个模型/预测器使用随机的子样本/bootstrap数据。
Then an average (weighted, normal, or by voting) of the scores from the different predictors is taken to get the final score/prediction.
然后对不同预测器的分数进行平均(加权、正常或投票),得到最终的分数/预测。
The most famous bagging method is random forest.
最著名的Bagging方法是随机森林。
三、Boosting
Boosting is a different ensemble technique, wherein the predictors are not independently trained but done so in a sequential manner.
Boosting是一种不同的集合技术,其中预测器不是独立训练的,而是以连续的方式进行的。
For example, we build a logistic regression model on a subsample/bootstrap of the original training data set.
例如,我们在原始训练数据集的一个子样本/bootstrap上建立一个逻辑回归模型。
Then we take the output of this model and feed it to a decision tree, to get the prediction, and so on.
然后,我们把这个模型的输出送入一个决策树,得到预测结果,如此反复。
The aim of this sequential training is for the subsequent models to learn from the mistakes of the previous model.
这种顺序训练的目的是让后面的模型从前面模型的错误中学习。
Gradient boosting is an example of a boosting method.
梯度提升是提升方法的一个例子。
四、Gradient Boosting
The main difference between gradient boosting compared to other boosting methods is that instead of incrementing the weights of misclassified outcomes from one previous learner to the next, we optimize the loss function of the previous learner.
与其他提升方法相比,梯度提升的主要区别在于,我们不是将错误分类的结果的权重从之前的学习者增加到下一个学习者,而是优化之前学习者的损失函数。
We will be building a boosted trees classifier, using the gradient boosting method under the hood.
我们将建立一个升压树分类器,在引擎盖下使用梯度升压方法。
We will take the iris data set for classification.
我们将使用鸢尾花数据集进行分类。
As we have already used the same data set for implementing logistic regression in the previous section, we will keep the preprocessing the same (i.e., until the “Build the input pipeline for TensorFlow model” step from the previous example).
由于我们已经在上一节中使用了相同的数据集来实现逻辑回归,我们将保持预处理的方式不变(即直到上一例中的 "为TensorFlow模型建立输入管道 "步骤)。
We will continue directly with the model training step, as follows:
我们将直接继续进行模型训练步骤,如下所示:
from __future__ import absolute_import, division, print_function, unicode_literals import numpy as np import pandas as pd import seaborn as sb import tensorflow as tf from tensorflow import keras from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score print(tf.__version__)
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'] SPECIES = ['Setosa', 'Versicolor', 'Virginica'] train_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv") test_path = tf.keras.utils.get_file("iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv") train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0) train = train[train.Species >= 1] train['Species'] = train['Species'].replace([1,2], [0,1]) test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0) test = test[test.Species >= 1] test['Species'] = test['Species'].replace([1,2], [0,1]) train.reset_index(drop=True, inplace=True) test.reset_index(drop=True, inplace=True) iris_dataset = pd.concat([train, test], axis=0)
iris_dataset.describe()
运行结果:
sb.pairplot(iris_dataset, diag_kind="kde")
运行结果:
correlation_data = iris_dataset.corr() correlation_data.style.background_gradient(cmap='coolwarm', axis=None)
运行结果:
stats = iris_dataset.describe() iris_stats = stats.transpose() iris_stats
运行结果:
X_data = iris_dataset[[i for i in iris_dataset.columns if i not in ['Species']]] Y_data = iris_dataset[['Species']]
train_features , test_features ,train_labels, test_labels = train_test_split(X_data , Y_data , test_size=0.3)
print('Training Features Rows: ', train_features.shape[0]) print('Test Features Rows: ', test_features.shape[0]) print('Training Features Columns: ', train_features.shape[1]) print('Test Features Columns: ', test_features.shape[1]) print('Training Label Rows: ', train_labels.shape[0]) print('Test Label Rows: ', test_labels.shape[0]) print('Training Label Columns: ', train_labels.shape[1]) print('Test Label Columns: ', test_labels.shape[1])
运行结果:
用于打印训练集和测试集的特征矩阵和标签矩阵的行数和列数。
(1)train_features.shape[0]:打印训练集特征矩阵的行数,即训练集中样本的数量。
(2)test_features.shape[0]:打印测试集特征矩阵的行数,即测试集中样本的数量。
(3)train_features.shape[1]:打印训练集特征矩阵的列数,即特征的数量。
(4)test_features.shape[1]:打印测试集特征矩阵的列数,即特征的数量。
(5)train_labels.shape[0]:打印训练集标签矩阵的行数,即训练集中样本的数量。
(6)test_labels.shape[0]:打印测试集标签矩阵的行数,即测试集中样本的数量。
(7)train_labels.shape[1]:打印训练集标签矩阵的列数,即标签的数量。
(8)test_labels.shape[1]:打印测试集标签矩阵的列数,即标签的数量。
stats = train_features.describe() stats = stats.transpose() stats
运行结果:
stats = test_features.describe() stats = stats.transpose() stats
运行结果:
Normalize Data
归一化数据
def norm(x): stats = x.describe() stats = stats.transpose() return (x - stats['mean']) / stats['std'] normed_train_features = norm(train_features) normed_test_features = norm(test_features)
def feed_input(features_df, target_df, num_of_epochs=10, shuffle=True, batch_size=35): def input_feed_function(): dataset = tf.data.Dataset.from_tensor_slices((dict(features_df), target_df)) if shuffle: dataset = dataset.shuffle(1000) dataset = dataset.batch(batch_size).repeat(num_of_epochs) return dataset return input_feed_function train_feed_input = feed_input(normed_train_features, train_labels) train_feed_input_testing = feed_input(normed_train_features, train_labels, num_of_epochs=1, shuffle=False) test_feed_input = feed_input(normed_test_features, test_labels, num_of_epochs=1, shuffle=False)
feature_columns_numeric = [tf.feature_column.numeric_column(k) for k in train_features.columns]
rf_model = tf.estimator.BoostedTreesClassifier(feature_columns=feature_columns_numeric, n_batches_per_layer=1)
使用 TensorFlow(tf)中的 tf.estimator.BoostedTreesClassifier 函数创建了一个梯度提升树分类器模型(Boosted Trees Classifier)。
rf_model.train(train_feed_input)
rf_model.train(train_feed_input) 表示对随机森林模型 rf_model 进行训练,其中 train_feed_input 是训练数据的输入。训练数据通常是一个特征矩阵和对应的目标变量(标签)组成的数据集,用于训练模型以学习特征与目标变量之间的关系。
Predictions
预测
train_predictions = rf_model.predict(train_feed_input_testing) test_predictions = rf_model.predict(test_feed_input)
train_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in train_predictions]) test_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in test_predictions])
train_predictions_df = pd.DataFrame(train_predictions_series, columns=['predictions']) test_predictions_df = pd.DataFrame(test_predictions_series, columns=['predictions'])
train_labels.reset_index(drop=True, inplace=True) train_predictions_df.reset_index(drop=True, inplace=True) test_labels.reset_index(drop=True, inplace=True) test_predictions_df.reset_index(drop=True, inplace=True)
train_labels_with_predictions_df = pd.concat([train_labels, train_predictions_df], axis=1) test_labels_with_predictions_df = pd.concat([test_labels, test_predictions_df], axis=1)
Validation
验证
def calculate_binary_class_scores(y_true, y_pred): acc_score = accuracy_score(y_true, y_pred.astype('int64')) prec_score = precision_score(y_true, y_pred.astype('int64')) rec_score = recall_score(y_true, y_pred.astype('int64')) return acc_score, prec_score, rec_score
train_accuracy_score, train_precision_score, train_recall_score = calculate_binary_class_scores(train_labels, train_predictions_series) test_accuracy_score, test_precision_score, test_recall_score = calculate_binary_class_scores(test_labels, test_predictions_series) print('Training Data Accuracy (%) = ', round(train_accuracy_score*100,2)) print('Training Data Precision (%) = ', round(train_precision_score*100,2)) print('Training Data Recall (%) = ', round(train_recall_score*100,2)) print('-'*50) print('Test Data Accuracy (%) = ', round(test_accuracy_score*100,2)) print('Test Data Precision (%) = ', round(test_precision_score*100,2)) print('Test Data Recall (%) = ', round(test_recall_score*100,2))
calculate_binary_class_scores 函数计算了训练数据集和测试数据集的准确率、精确率和召回率,并将结果进行打印输出。