正文

scikit-learn：4.2.featureextraction（特征提取，不是特征选择）

wzzkaifa  wzzkaifa  2022-09-09  622

关键词：

http://scikit-learn.org/stable/modules/feature_extraction.html

带病在网吧里。

。。。。。

写。求支持。

。。

1、首先澄清两个概念：特征提取和特征选择（

Feature extraction is very different from Feature selection

）。

the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features（从已经提取的特征中选择更好的特征）.

以下分为四大部分来讲。主要还是4、text feature extraction

2、loading features form dicts

class DictVectorizer。举个样例就好：

>>> measurements = [
...     {‘city‘: ‘Dubai‘, ‘temperature‘: 33.},
...     {‘city‘: ‘London‘, ‘temperature‘: 12.},
...     {‘city‘: ‘San Fransisco‘, ‘temperature‘: 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])
>>> vec.get_feature_names()
[‘city=Dubai‘, ‘city=London‘, ‘city=San Fransisco‘, ‘temperature‘]

class DictVectorizer对于提取某个特定词汇附近的feature windows很实用，比如增加我们通过一个已有的algorithm提取了word ‘sat’ 在句子‘The cat sat on the mat.’中的PoS（Part of Speech）特征。例如以下：

>>> pos_window = [
...     {
...         ‘word-2‘: ‘the‘,
...         ‘pos-2‘: ‘DT‘,
...         ‘word-1‘: ‘cat‘,
...         ‘pos-1‘: ‘NN‘,
...         ‘word+1‘: ‘on‘,
...         ‘pos+1‘: ‘PP‘,
...     },
...     # in a real application one would extract many such dictionaries
... ]

上面的PoS特征就能够vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):

>>>
>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type ‘<... ‘numpy.float64‘>‘
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
[‘pos+1=PP‘, ‘pos-1=NN‘, ‘pos-2=DT‘, ‘word+1=on‘, ‘word-1=cat‘, ‘word-2=the‘]

3、feature hashing

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.

因为hash。所以仅仅保存feature的interger index。而不保存原来feature的string名字。所以没有inverse_transform方法。

FeatureHasher 接收dict对，即 (feature, value) 对，或者strings，由构造函数的參数input_type决定.结果是scipy.sparse matrix。假设是strings，则value默认取1，比如 [‘feat1‘, ‘feat2‘, ‘feat2‘] 被解释为[(‘feat1‘, 1), (‘feat2‘, 2)].

4、text feature extraction

由于内容太多，分开写了。參考着篇博客：http://blog.csdn.net/mmc2015/article/details/46997379

5、image feature extraction

提取部分图片（Patch extraction）：

The extract_patches_2d function从图片中提取小块，存储成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 可以将全部的小块重构成原图：

>>> import numpy as np
>>> from sklearn.feature_extraction import image

>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0]  # R channel of a fake RGB picture
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21],
       [24, 27, 30, 33],
       [36, 39, 42, 45]])

>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
...     random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0,  3],
        [12, 15]],

       [[15, 18],
        [27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
       [27, 30]])

重构方式例如以下：

>>> reconstructed = image.reconstruct_from_patches_2d(patches, (4, 4, 3))
>>> np.testing.assert_array_equal(one_image, reconstructed)

The PatchExtractor class和 extract_patches_2d,一样，仅仅只是能够同一时候接受多个图片作为输入：

>>> five_images = np.arange(5 * 4 * 4 * 3).reshape(5, 4, 4, 3)
>>> patches = image.PatchExtractor((2, 2)).transform(five_images)
>>> patches.shape
(45, 2, 2, 3)

图片像素的连接（Connectivity graph of an image）：

主要是依据像素的区别来推断图片的每两个像素点是否连接。

。。

。

The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.

这有个直观的样例：http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py

头疼。。。。

碎觉。

。。

[机器学习与scikit-learn-2]：如何学习scikit-learn(代码片段)

...https://blog.csdn.net/HiWangWenBing/article/details/123119431目录第1章Scikit-learn的中文学习网站第2章Scikit-learn待学习的对象2.1四大问题以及算法分类2. 查看详情

无法导入 Scikit-Learn

】无法导入Scikit-Learn【英文标题】：CannotimportScikit-Learn【发布时间】：2012-07-1222:44:37【问题描述】：我尝试在我的LinuxMint12上安装scikit-learn但失败了。我从http://pypi.python.org/pypi/scikit-learn/下载了包并安装了sudopython2.7setup.pyinstall然... 查看详情

scikit-learn学习基础知识二

scikit-learn学习基础知识二文章目录scikit-learn学习基础知识二一、介绍二、代码实现三、运行结果四、总结一、介绍本文我们学习scikit-learn中的KNeighborRegressor函数来实现KNN回归进行分类的案例。二、代码实现"""K-NNKNN的一... 查看详情

Scikit-learn 多线程

】Scikit-learn多线程【英文标题】：Scikit-learnmultithreading【发布时间】：2018-12-2110:28:00【问题描述】：您知道来自scikit-learn的模型是使用自动多线程还是仅使用顺序指令？谢谢【问题讨论】：【参考方案1】：没有。默认情况下，所... 查看详情

scikit-learn 拟合函数分类

】scikit-learn拟合函数分类【英文标题】：scikit-learnfitfunctionclassification【发布时间】：2016-04-2207:22:35【问题描述】：我在scikit-learn中使用fit函数进行分类训练。例如，在使用随机森林时，通常会使用以下类型的代码：importsklearnfrom... 查看详情

sklearn (scikit-learn) 逻辑回归包——设置训练的分类系数。

】sklearn(scikit-learn)逻辑回归包——设置训练的分类系数。【英文标题】：sklearn(scikit-learn)logisticregressionpackage--settrainedcoefficientsforclassification.【发布时间】：2012-01-2206:33:47【问题描述】：于是我阅读了scikit-learn包webpate：http://scikit... 查看详情

Scikit-Learn 逻辑回归严重过拟合数字分类训练数据

】Scikit-Learn逻辑回归严重过拟合数字分类训练数据【英文标题】：Scikit-Learn\'sLogisticRegressionseverelyoverfitsdigitclassificationtrainingdata【发布时间】：2021-01-0923:30:03【问题描述】：我正在使用Scikit-Learn的逻辑回归算法来执行数字分类。... 查看详情

scikit-learn：如何缩减“y”预测结果

】scikit-learn：如何缩减“y”预测结果【英文标题】：scikit-learn:howtoscalebackthe\'y\'predictedresult【发布时间】：2016-10-2918:02:35【问题描述】：我正在尝试使用波士顿住房数据集来学习scikit-learn和机器学习。#Isplittedtheinitialdataset(\'housin... 查看详情

scikit-learn 的 TfidfVectorizer 在线版

】scikit-learn的TfidfVectorizer在线版【英文标题】：Onlineversionofscikit-learn\'sTfidfVectorizer【发布时间】：2014-08-2210:59:12【问题描述】：我希望使用scikit-learn的HashingVectorizer，因为它非常适合在线学习问题（文本中的新标记保证映射到“... 查看详情

Scikit-learn，GroupKFold 与洗牌组？

】Scikit-learn，GroupKFold与洗牌组？【英文标题】：Scikit-learn,GroupKFoldwithshufflinggroups?【发布时间】：2017-04-1017:12:38【问题描述】：我正在使用scikit-learn中的StratifiedKFold，但现在我还需要注意“组”。有很好的功能GroupKFold，但我的数... 查看详情

初试scikit-learn库

文章目录一、scikit-learn库二、数据的加载三、加载自带标准数据集（一）函数原型（二）参数说明（三）自带7个标准数据集1、波士顿房价数据集（1）波士顿房价数据集概述（2）加载波士顿房价数据集可选操作：加载加利福利... 查看详情

初试scikit-learn库

使用 SelectKBest 的问题 [scikit-learn]

】使用SelectKBest的问题[scikit-learn]【英文标题】：TroublesusingSelectKBest[scikit-learn]【发布时间】：2016-08-1320:04:27【问题描述】：我是scikit-learn和python的初学者，我使用feature_selection包中的SelectKBest尝试了这段看起来非常简单的代码。tr... 查看详情

用scikit-learn学习谱聚类

...理总结中，我们对谱聚类的原理做了总结。这里我们就对scikit-learn中谱聚类的使用做一个总结。1.scikit-learn谱聚类概述　　　　在scikit-learn的类库中，sklearn.cluster.SpectralClustering实现了基于Ncut的谱聚类，没有实现基于RatioCut的切图... 查看详情

scikit-learn 中的随机森林解释

】scikit-learn中的随机森林解释【英文标题】：RandomForestinterpretationinscikit-learn【发布时间】：2013-04-2602:44:26【问题描述】：我正在使用scikit-learn\'sRandomForestRegressor在数据集上拟合随机森林回归量。是否可以以一种格式解释输出，... 查看详情

Scikit-learn：在 GridSearchCV 中评分

】Scikit-learn：在GridSearchCV中评分【英文标题】：Scikit-learn:scoringinGridSearchCV【发布时间】：2019-03-3023:21:27【问题描述】：scikit-learn的GridSearchCV似乎收集了其（内部）交叉验证折叠的分数，然后对所有折叠的分数进行平均。我想知... 查看详情

Scikit-learn：在 GridSearchCV 中评分

】Scikit-learn：在GridSearchCV中评分【英文标题】：Scikit-learn:scoringinGridSearchCV【发布时间】：2018-05-1407:23:02【问题描述】：scikit-learn的GridSearchCV似乎收集了其（内部）交叉验证折叠的分数，然后对所有折叠的分数进行平均。我想知... 查看详情

使用 scikit-learn 去除低方差的特征

】使用scikit-learn去除低方差的特征【英文标题】：Removingfeatureswithlowvarianceusingscikit-learn【发布时间】：2015-05-3116:19:30【问题描述】：scikit-learn提供了多种删除描述符的方法，下面给出的教程已经提供了用于此目的的基本方法，htt... 查看详情