一、Ensemble Technique 集合技术
An ensemble is a collection of predictors. For example, instead of using a single model (say, logistic regression) for a classification problem, we can use multiple models (say, logistic regression + decision trees, etc.) to perform predictions.
The outputs from the predictors are combined by different averaging methods, such as weighted averages, normal averages, or votes, and a final prediction value is derived. Ensemble methods have been proved to be more effective than individual methods and, therefore, are heavily used to build machine learning models.
Ensemble methods can be implemented by either bagging or boosting.
集成方法可以通过 bagging 或 boosting 来实现。
Boosting:每一轮的训练集不变,只是训练集中每个样例在分类器中的权重发生变化。 而权值是根据上一轮的分类结果进行调整。 Boosting:根据错误率不断调整样例的权值,错误率越大则权重越大。
Bagging is a technique wherein we build independent models/predictors, using a random subsample/bootstrap of data for each of the models/predictors.
Then an average (weighted, normal, or by voting) of the scores from the different predictors is taken to get the final score/prediction.
The most famous bagging method is random forest.
Boosting is a different ensemble technique, wherein the predictors are not independently trained but done so in a sequential manner.
For example, we build a logistic regression model on a subsample/bootstrap of the original training data set.
Then we take the output of this model and feed it to a decision tree, to get the prediction, and so on.
The aim of this sequential training is for the subsequent models to learn from the mistakes of the previous model.
Gradient boosting is an example of a boosting method.
四、Gradient Boosting
The main difference between gradient boosting compared to other boosting methods is that instead of incrementing the weights of misclassified outcomes from one previous learner to the next, we optimize the loss function of the previous learner.
We will be building a boosted trees classifier, using the gradient boosting method under the hood.
We will take the iris data set for classification.
As we have already used the same data set for implementing logistic regression in the previous section, we will keep the preprocessing the same (i.e., until the “Build the input pipeline for TensorFlow model” step from the previous example).
由于我们已经在上一节中使用了相同的数据集来实现逻辑回归,我们将保持预处理的方式不变(即直到上一例中的 "为TensorFlow模型建立输入管道 "步骤)。
We will continue directly with the model training step, as follows:
from __future__ import absolute_import, division, print_function, unicode_literals import numpy as np import pandas as pd import seaborn as sb import tensorflow as tf from tensorflow import keras from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score print(tf.__version__)
CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'] SPECIES = ['Setosa', 'Versicolor', 'Virginica'] train_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv") test_path = tf.keras.utils.get_file("iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv") train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0) train = train[train.Species >= 1] train['Species'] = train['Species'].replace([1,2], [0,1]) test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0) test = test[test.Species >= 1] test['Species'] = test['Species'].replace([1,2], [0,1]) train.reset_index(drop=True, inplace=True) test.reset_index(drop=True, inplace=True) iris_dataset = pd.concat([train, test], axis=0)
sb.pairplot(iris_dataset, diag_kind="kde")
correlation_data = iris_dataset.corr() correlation_data.style.background_gradient(cmap='coolwarm', axis=None)
stats = iris_dataset.describe() iris_stats = stats.transpose() iris_stats
X_data = iris_dataset[[i for i in iris_dataset.columns if i not in ['Species']]] Y_data = iris_dataset[['Species']]
train_features , test_features ,train_labels, test_labels = train_test_split(X_data , Y_data , test_size=0.3)
print('Training Features Rows: ', train_features.shape[0]) print('Test Features Rows: ', test_features.shape[0]) print('Training Features Columns: ', train_features.shape[1]) print('Test Features Columns: ', test_features.shape[1]) print('Training Label Rows: ', train_labels.shape[0]) print('Test Label Rows: ', test_labels.shape[0]) print('Training Label Columns: ', train_labels.shape[1]) print('Test Label Columns: ', test_labels.shape[1])
stats = train_features.describe() stats = stats.transpose() stats
stats = test_features.describe() stats = stats.transpose() stats
Normalize Data
def norm(x): stats = x.describe() stats = stats.transpose() return (x - stats['mean']) / stats['std'] normed_train_features = norm(train_features) normed_test_features = norm(test_features)
def feed_input(features_df, target_df, num_of_epochs=10, shuffle=True, batch_size=35): def input_feed_function(): dataset = tf.data.Dataset.from_tensor_slices((dict(features_df), target_df)) if shuffle: dataset = dataset.shuffle(1000) dataset = dataset.batch(batch_size).repeat(num_of_epochs) return dataset return input_feed_function train_feed_input = feed_input(normed_train_features, train_labels) train_feed_input_testing = feed_input(normed_train_features, train_labels, num_of_epochs=1, shuffle=False) test_feed_input = feed_input(normed_test_features, test_labels, num_of_epochs=1, shuffle=False)
feature_columns_numeric = [tf.feature_column.numeric_column(k) for k in train_features.columns]
rf_model = tf.estimator.BoostedTreesClassifier(feature_columns=feature_columns_numeric, n_batches_per_layer=1)
使用 TensorFlow(tf)中的 tf.estimator.BoostedTreesClassifier 函数创建了一个梯度提升树分类器模型(Boosted Trees Classifier)。
rf_model.train(train_feed_input) 表示对随机森林模型 rf_model 进行训练,其中 train_feed_input 是训练数据的输入。训练数据通常是一个特征矩阵和对应的目标变量(标签)组成的数据集,用于训练模型以学习特征与目标变量之间的关系。
train_predictions = rf_model.predict(train_feed_input_testing) test_predictions = rf_model.predict(test_feed_input)
train_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in train_predictions]) test_predictions_series = pd.Series([p['classes'][0].decode("utf-8") for p in test_predictions])
train_predictions_df = pd.DataFrame(train_predictions_series, columns=['predictions']) test_predictions_df = pd.DataFrame(test_predictions_series, columns=['predictions'])
train_labels.reset_index(drop=True, inplace=True) train_predictions_df.reset_index(drop=True, inplace=True) test_labels.reset_index(drop=True, inplace=True) test_predictions_df.reset_index(drop=True, inplace=True)
train_labels_with_predictions_df = pd.concat([train_labels, train_predictions_df], axis=1) test_labels_with_predictions_df = pd.concat([test_labels, test_predictions_df], axis=1)
def calculate_binary_class_scores(y_true, y_pred): acc_score = accuracy_score(y_true, y_pred.astype('int64')) prec_score = precision_score(y_true, y_pred.astype('int64')) rec_score = recall_score(y_true, y_pred.astype('int64')) return acc_score, prec_score, rec_score
train_accuracy_score, train_precision_score, train_recall_score = calculate_binary_class_scores(train_labels, train_predictions_series) test_accuracy_score, test_precision_score, test_recall_score = calculate_binary_class_scores(test_labels, test_predictions_series) print('Training Data Accuracy (%) = ', round(train_accuracy_score*100,2)) print('Training Data Precision (%) = ', round(train_precision_score*100,2)) print('Training Data Recall (%) = ', round(train_recall_score*100,2)) print('-'*50) print('Test Data Accuracy (%) = ', round(test_accuracy_score*100,2)) print('Test Data Precision (%) = ', round(test_precision_score*100,2)) print('Test Data Recall (%) = ', round(test_recall_score*100,2))
calculate_binary_class_scores 函数计算了训练数据集和测试数据集的准确率、精确率和召回率,并将结果进行打印输出。