一.feature_extraction
1.简介:
该模块用于对原始数据进行"特征提取"(feature extraction)
2.使用:
将"特征值映射列表"(lists of feature-value mappings)转换为矢量:class sklearn.feature_extraction.DictVectorizer([dtype<class 'numpy.float64'>,separator='=',sparse=True,sort=True]) 实现"特征哈希"(feature hashing)/"哈希技巧"(hashing trick):class sklearn.feature_extraction.FeatureHasher([n_features=1048576,input_type='dict',dtype=<class 'numpy.float64'>,alternate_sign=True])
3.feature_extraction.image
(1)简介:
该子模块用于从图像中提取特征
(2)方法:
将2维图像转换为"补丁集合"(collection of patches):[<patches>=]sklearn.feature_extraction.image.extract_patches_2d(<image>,<patch_size>[,max_patches=None,random_state=None]) 获取"像素连接图"(Graph of the pixel-to-pixel connections):sklearn.feature_extraction.image.grid_to_graph(<n_x>,<n_y>[,n_z=1,mask=None,return_as=<class 'scipy.sparse.coo.coo_matrix'>,dtype=<class 'int'>]) 获取"像素梯度连接图"(Graph of the pixel-to-pixel gradient connections):sklearn.feature_extraction.image.img_to_graph(<img>[,mask=None,return_as=<class 'scipy.sparse.coo.coo_matrix'>,dtype=None]) 通过图像的全部补丁重建图像:[<image>=]sklearn.feature_extraction.image.reconstruct_from_patches_2d(<patches>,<image_size>)
(3)类:
从图像集合中提取补丁:class sklearn.feature_extraction.image.PatchExtractor([patch_size=None,max_patches=None,random_state=None])
4.feature_extraction.text
(1)简介:
该子模块用于从文本文档中提取特征
(2)使用:
将文本文档集合转换为"令牌计数矩阵"(matrix of token counts):class sklearn.feature_extraction.text.CountVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),analyzer='word',max_df=1.0,min_df=1,max_features=None,vocabulary=None,binary=False,dtype=<class 'numpy.int64'>]) 将文本文档集合转换为"令牌出现矩阵"(matrix of token occurrences):class sklearn.feature_extraction.text.HashingVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),analyzer='word',n_features=1048576,binary=False,norm='l2',alternate_sign=True,dtype=<class 'numpy.float64'>]) 将"计数矩阵"(count matrix)转换为"经过归一化的词频(-逆文档频率)表示"(normalized tf(-idf) representation):class sklearn.feature_extraction.text.TfidfTransformer([norm='l2',use_idf=True,smooth_idf=True,sublinear_tf=False]) 将原始文档集合转换为"词频-逆文档频率特征矩阵"(matrix of TF-IDF features):class sklearn.feature_extraction.text.TfidfVectorizer([input='content',encoding='utf-8',decode_error='strict',strip_accents=None,lowercase=True,preprocessor=None,tokenizer=None,analyzer='word',stop_words=None,token_pattern='(?u)\b\w\w+\b',ngram_range=(1,1),max_df=1.0,min_df=1,max_features=None,vocabulary=None,binary=False,dtype=<class 'numpy.float64'>,norm='l2',use_idf=True,smooth_idf=True,sublinear_tf=False])
二.feature_selection
1.简介:
该模块用于进行"特征选择"(feature selection),包括"单变量过滤选择方法"(univariate filter selection methods)和"递归特征消除方法" (recursive feature elimination algorithm)
2.使用: