Java教程

特征重要性计算之LOFO与FLOFO

本文主要是介绍特征重要性计算之LOFO与FLOFO,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

1. 引入

特征的重要性,即feature importance,使用sklearn自带的一些模型,就能计算出来。
比如RandomForest取feature_importance的用法如下:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
data = load_iris()
x_data = data.data
y_data = data.target
print(x_data.shape,y_data.shape)# (150, 4) (150,)
model = RandomForestClassifier()
model.fit(x_data,y_data)
model.feature_importances_ # array([0.09366231, 0.02290373, 0.44489138, 0.43854258])

模型训练后,通过调用模型的属性(feature_importances_),就能得到各个特征的重要性值,值越大说明特征重要性越高。

但有些模型并不自带计算feature_importances的机制,那该如何得到feature_importances呢?

2. LOFO与FLOFO

  1. LOFO

LOFO是Leave One Feature Out的缩写,他计算特征重要性的思路是:遍历去掉每一个特征,用留下的特征训练模型,在验证集上评估模型效果,以此来衡量模型的重要性。

用验证集评估模型时,使用KFold的方式,K次训练、预测的过程,就能得到K个评估值,所以LOFO能输出特征重要性的均值与标准差。

如果不输入模型,参考1中LOFO的实现默认用LightGBM来进行评估。

  1. FLOFO

FLOFO是Fast LOFO的意思。

LOFO的计算过程,需要循环迭代“移除一个特征,KFold训练评估”的整个过程,比较耗时。FLOFO是为了加速(简化)这个过程的。

FLOFO会使用全特征来训练好一个模型,然后依次循环迭代“对某一个特征值进行随机扰动,使用已经训练好的模型来验证”,这个过程不需要重新训练模型,所以会很快。FLOFO的重要性,就用扰动前的结果减去扰动后的结果。(结果在这里可以是AUC/ACC之类的值,这个扰动前后结果越大,说明特征越重要)

3. LOFO示例代码

下面对sklearn自带的breast_cancer数据集使用LOFO:

  1. 导入依赖
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

这里导入模型、KFold,和LOFO。要注意LOFO中大量使用dataframe,所以还要导入pandas。

  1. 导入数据集

导入breast_cancer数据集,要注意需要将数据转换为dataframe格式。

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values
print(df.shape)# (569, 31)

breast_cancer是二分类数据集,即target中只有0和1两个数值。

  1. 将数据集转换为LOFO的Dataset格式

必须要将dataframe的数据集包装为Dataset结构,才能调用LOFO的相关接口。

dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])

参数 target 是df中表示y值的列名,features是df中的特征名list。

  1. 获取任意模型对应的feature_importance

这里以 RandomForestClassifier 为例

model = RandomForestClassifier()
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()

以5Fold为例来对模型做验证,所以最终每个特征能得到5个importance值。importance_df的值如下

featureimportance_meanimportance_stdval_imp_0val_imp_1val_imp_2val_imp_3val_imp_4
26area error0.01049530.01514960.02631580.01754390.01754390.00877193-0.0176991
23worst perimeter0.008787460.01240540.01754390-0.008771930.02631580.00884956
29mean smoothness0.007048590.00863287-0.008771930.008771930.008771930.008771930.0176991
24mean texture0.007048590.0170287-0.008771930.03508770-0.008771930.0176991
1mean radius0.005278680.0070253700.0175439000.00884956
16mean compactness0.003555350.0143273000.00877193-0.01754390.0265487
4perimeter error0.00352430.0105341-0.008771930.0175439-0.008771930.008771930.00884956
9worst area0.001754390.006564310.008771930.008771930-0.008771930
11mean symmetry0.001754390.0065643100.008771930.00877193-0.008771930
3worst fractal dimension0.001754390.003508770000.008771930
22worst radius0.001754390.0085947-0.008771930.0175439000
8radius error0.001738860.00861375-0.008771930.008771930.008771930.00877193-0.00884956
17texture error3.10511e-050.0124494-0.0087719300.00877193-0.01754390.0176991
19mean concavity1.55255e-050.0096233800.008771930-0.01754390.00884956
14fractal dimension error0000000
2mean concave points-1.55255e-050.009623380.01754390-0.008771930-0.00884956
7worst concavity-4.65766e-050.014761800.01754390.008771930-0.0265487
10concavity error-0.001738860.0065892300-0.00877193-0.008771930.00884956
0mean fractal dimension-0.001754390.00350877000-0.008771930
13worst compactness-0.001754390.00350877-0.008771930000
28mean perimeter-0.003508770.00701754000-0.01754390
12smoothness error-0.00352430.0043164400-0.008771930-0.00884956
27mean area-0.007033070.00656853-0.0087719300-0.0175439-0.00884956
18compactness error-0.008802980.00788074-0.008771930-0.01754390-0.0176991
  1. 画出feature_importance图

LOFO自带了画图的接口,可以直接对importance_df做可视化

plot_importance(importance_df, figsize=(12, 20))

这样就能得到feature importanc的排序输出结果:

在这里插入图片描述

  1. 多分类数据集测试

测试了iris数据集,发现程序可以正常运行,但importance_df中的值都为NaN,哪怕把scoring="f1"改为scoring="f1_macro"后就正常了。

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_iris(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1_macro",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)
  1. 最终代码

综合上述过程,得到直接能运行的代码如下:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance

data = load_breast_cancer(as_frame=True)# load as dataframe
df = data.data
df['target']=data.target.values

# model
model = RandomForestClassifier()
# dataset
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
# get feature importance
cv = KFold(n_splits=5, shuffle=True, random_state=666)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

4. FLOFO示例代码

Fast LOFO直接调用FLOFOImportance即可,参考代码如下:

from lofo import FLOFOImportance
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.model_selection import KFold
from lofo import LOFOImportance, Dataset, plot_importance
# step-01: prepare data
data = load_breast_cancer(as_frame=True)# load as dataframe
x_data = data.data.to_numpy()
y_data = data.target.values
df = data.data
df['target']=data.target.values
# repeat more data since FLOFO need > 1000 data
df=pd.DataFrame(pd.np.repeat(df.values,2,axis=0),columns=df.columns)
# step-02: train model
model = RandomForestClassifier()
model.fit(x_data,y_data)
# step-03: fast-lofo
lofo_imp = FLOFOImportance(validation_df=df, target="target", features=[col for col in df.columns if col != 'target'],scoring="f1",trained_model=model)
importance_df = lofo_imp.get_importance()
print(importance_df)

FLOFOImportance与LOFOImportance的几点区别:

  1. FLOFOImportance不在需要将数据包装为Dataset结构
  2. FLOFOImportance需要先训练模型,再调用FLOFO
  3. FLOFOImportance接口使用与LOFO稍有不同

结论

  1. RandomForest能直接对多分类数据计算feature_importances_
  2. LOFO默认使用LightGBM来计算得到feature_importances_
  3. 参考1中的LOFO库支持多分类数据集,但需要把scoring="f1"改为scoring="f1_macro"等支持多类别的评估准则
  4. FLOFO(Fast LOFO)比LOFO运行更快

参考

  1. https://github.com/aerdem4/lofo-importance
  2. https://juejin.cn/post/7020237735516438564
这篇关于特征重要性计算之LOFO与FLOFO的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!