特征的重要性,即feature importance,使用sklearn自带的一些模型,就能计算出来。
比如RandomForest取feature_importance的用法如下:
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer, load_iris data = load_iris() x_data = data.data y_data = data.target print(x_data.shape,y_data.shape)# (150, 4) (150,) model = RandomForestClassifier() model.fit(x_data,y_data) model.feature_importances_ # array([0.09366231, 0.02290373, 0.44489138, 0.43854258])
模型训练后,通过调用模型的属性(feature_importances_),就能得到各个特征的重要性值,值越大说明特征重要性越高。
但有些模型并不自带计算feature_importances的机制,那该如何得到feature_importances呢?
LOFO是Leave One Feature Out的缩写,他计算特征重要性的思路是:遍历去掉每一个特征,用留下的特征训练模型,在验证集上评估模型效果,以此来衡量模型的重要性。
用验证集评估模型时,使用KFold的方式,K次训练、预测的过程,就能得到K个评估值,所以LOFO能输出特征重要性的均值与标准差。
如果不输入模型,参考1中LOFO的实现默认用LightGBM来进行评估。
FLOFO是Fast LOFO的意思。
LOFO的计算过程,需要循环迭代“移除一个特征,KFold训练评估”的整个过程,比较耗时。FLOFO是为了加速(简化)这个过程的。
FLOFO会使用全特征来训练好一个模型,然后依次循环迭代“对某一个特征值进行随机扰动,使用已经训练好的模型来验证”,这个过程不需要重新训练模型,所以会很快。FLOFO的重要性,就用扰动前的结果减去扰动后的结果。(结果在这里可以是AUC/ACC之类的值,这个扰动前后结果越大,说明特征越重要)
下面对sklearn自带的breast_cancer数据集使用LOFO:
import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer, load_iris from sklearn.model_selection import KFold from lofo import LOFOImportance, Dataset, plot_importance
这里导入模型、KFold,和LOFO。要注意LOFO中大量使用dataframe,所以还要导入pandas。
导入breast_cancer数据集,要注意需要将数据转换为dataframe格式。
data = load_breast_cancer(as_frame=True)# load as dataframe df = data.data df['target']=data.target.values print(df.shape)# (569, 31)
breast_cancer是二分类数据集,即target中只有0和1两个数值。
必须要将dataframe的数据集包装为Dataset结构,才能调用LOFO的相关接口。
dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target'])
参数 target 是df中表示y值的列名,features是df中的特征名list。
这里以 RandomForestClassifier 为例
model = RandomForestClassifier() cv = KFold(n_splits=5, shuffle=True, random_state=666) lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model) importance_df = lofo_imp.get_importance()
以5Fold为例来对模型做验证,所以最终每个特征能得到5个importance值。importance_df的值如下
feature | importance_mean | importance_std | val_imp_0 | val_imp_1 | val_imp_2 | val_imp_3 | val_imp_4 | |
---|---|---|---|---|---|---|---|---|
26 | area error | 0.0104953 | 0.0151496 | 0.0263158 | 0.0175439 | 0.0175439 | 0.00877193 | -0.0176991 |
23 | worst perimeter | 0.00878746 | 0.0124054 | 0.0175439 | 0 | -0.00877193 | 0.0263158 | 0.00884956 |
29 | mean smoothness | 0.00704859 | 0.00863287 | -0.00877193 | 0.00877193 | 0.00877193 | 0.00877193 | 0.0176991 |
24 | mean texture | 0.00704859 | 0.0170287 | -0.00877193 | 0.0350877 | 0 | -0.00877193 | 0.0176991 |
1 | mean radius | 0.00527868 | 0.00702537 | 0 | 0.0175439 | 0 | 0 | 0.00884956 |
16 | mean compactness | 0.00355535 | 0.0143273 | 0 | 0 | 0.00877193 | -0.0175439 | 0.0265487 |
4 | perimeter error | 0.0035243 | 0.0105341 | -0.00877193 | 0.0175439 | -0.00877193 | 0.00877193 | 0.00884956 |
9 | worst area | 0.00175439 | 0.00656431 | 0.00877193 | 0.00877193 | 0 | -0.00877193 | 0 |
11 | mean symmetry | 0.00175439 | 0.00656431 | 0 | 0.00877193 | 0.00877193 | -0.00877193 | 0 |
3 | worst fractal dimension | 0.00175439 | 0.00350877 | 0 | 0 | 0 | 0.00877193 | 0 |
22 | worst radius | 0.00175439 | 0.0085947 | -0.00877193 | 0.0175439 | 0 | 0 | 0 |
8 | radius error | 0.00173886 | 0.00861375 | -0.00877193 | 0.00877193 | 0.00877193 | 0.00877193 | -0.00884956 |
17 | texture error | 3.10511e-05 | 0.0124494 | -0.00877193 | 0 | 0.00877193 | -0.0175439 | 0.0176991 |
19 | mean concavity | 1.55255e-05 | 0.00962338 | 0 | 0.00877193 | 0 | -0.0175439 | 0.00884956 |
14 | fractal dimension error | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | mean concave points | -1.55255e-05 | 0.00962338 | 0.0175439 | 0 | -0.00877193 | 0 | -0.00884956 |
7 | worst concavity | -4.65766e-05 | 0.0147618 | 0 | 0.0175439 | 0.00877193 | 0 | -0.0265487 |
10 | concavity error | -0.00173886 | 0.00658923 | 0 | 0 | -0.00877193 | -0.00877193 | 0.00884956 |
0 | mean fractal dimension | -0.00175439 | 0.00350877 | 0 | 0 | 0 | -0.00877193 | 0 |
13 | worst compactness | -0.00175439 | 0.00350877 | -0.00877193 | 0 | 0 | 0 | 0 |
28 | mean perimeter | -0.00350877 | 0.00701754 | 0 | 0 | 0 | -0.0175439 | 0 |
12 | smoothness error | -0.0035243 | 0.00431644 | 0 | 0 | -0.00877193 | 0 | -0.00884956 |
27 | mean area | -0.00703307 | 0.00656853 | -0.00877193 | 0 | 0 | -0.0175439 | -0.00884956 |
18 | compactness error | -0.00880298 | 0.00788074 | -0.00877193 | 0 | -0.0175439 | 0 | -0.0176991 |
LOFO自带了画图的接口,可以直接对importance_df做可视化
plot_importance(importance_df, figsize=(12, 20))
这样就能得到feature importanc的排序输出结果:
测试了iris数据集,发现程序可以正常运行,但importance_df中的值都为NaN,哪怕把scoring="f1"改为scoring="f1_macro"后就正常了。
import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer, load_iris from sklearn.model_selection import KFold from lofo import LOFOImportance, Dataset, plot_importance data = load_iris(as_frame=True)# load as dataframe df = data.data df['target']=data.target.values # model model = RandomForestClassifier() # dataset dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target']) # get feature importance cv = KFold(n_splits=5, shuffle=True, random_state=666) lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1_macro",model=model) importance_df = lofo_imp.get_importance() print(importance_df)
综合上述过程,得到直接能运行的代码如下:
import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer, load_iris from sklearn.model_selection import KFold from lofo import LOFOImportance, Dataset, plot_importance data = load_breast_cancer(as_frame=True)# load as dataframe df = data.data df['target']=data.target.values # model model = RandomForestClassifier() # dataset dataset = Dataset(df=df, target="target", features=[col for col in df.columns if col != 'target']) # get feature importance cv = KFold(n_splits=5, shuffle=True, random_state=666) lofo_imp = LOFOImportance(dataset, cv=cv, scoring="f1",model=model) importance_df = lofo_imp.get_importance() print(importance_df)
Fast LOFO直接调用FLOFOImportance即可,参考代码如下:
from lofo import FLOFOImportance import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer, load_iris from sklearn.model_selection import KFold from lofo import LOFOImportance, Dataset, plot_importance # step-01: prepare data data = load_breast_cancer(as_frame=True)# load as dataframe x_data = data.data.to_numpy() y_data = data.target.values df = data.data df['target']=data.target.values # repeat more data since FLOFO need > 1000 data df=pd.DataFrame(pd.np.repeat(df.values,2,axis=0),columns=df.columns) # step-02: train model model = RandomForestClassifier() model.fit(x_data,y_data) # step-03: fast-lofo lofo_imp = FLOFOImportance(validation_df=df, target="target", features=[col for col in df.columns if col != 'target'],scoring="f1",trained_model=model) importance_df = lofo_imp.get_importance() print(importance_df)
FLOFOImportance与LOFOImportance的几点区别: