scikit-learn数据集

我们将介绍sklearn中的数据集类，模块包括用于加载数据集的实用程序，包括加载和获取流行参考数据集的方法。它还具有一些人工数据生成器。

sklearn数据集

sklearn数据集.png

sklearn.datasets

（1）datasets.load_*()

获取小规模数据集，数据包含在datasets里

（2）datasets.fetch_*()

获取大规模数据集，需要从网络上下载，函数的第一个参数是data_home，表示数据集下载的目录，默认是 ~/scikit_learn_data/，要修改默认目录，可以修改环境变量SCIKIT_LEARN_DATA

（3）datasets.make_*()

本地生成数据集

load*和 fetch* 函数返回的数据类型是 datasets.base.Bunch，本质上是一个 dict，它的键值对可用通过对象的属性方式访问。主要包含以下属性：
- data：特征数据数组，是 n_samples * n_features 的二维 numpy.ndarray 数组
- target：标签数组，是 n_samples 的一维 numpy.ndarray 数组
- DESCR：数据描述
- feature_names：特征名
- target_names：标签名
数据集目录可以通过datasets.get_data_home()获取，clear_data_home(data_home=None)删除所有下载数据
- datasets.get_data_home(data_home=None)
返回scikit学习数据目录的路径。这个文件夹被一些大的数据集装载器使用，以避免下载数据。默认情况下，数据目录设置为用户主文件夹中名为“scikit_learn_data”的文件夹。或者，可以通过“SCIKIT_LEARN_DATA”环境变量或通过给出显式的文件夹路径以编程方式设置它。'〜'符号扩展到用户主文件夹。如果文件夹不存在，则会自动创建。
- sklearn.datasets.clear_data_home(data_home=None)
删除存储目录中的数据

获取小数据集

用于分类

sklearn.datasets.load_iris

鸢尾花数据集采集的是鸢尾花的测量数据以及其所属的类别。测量数据包括：萼片长度、萼片宽度、花瓣长度、花瓣宽度。类别共分为三类：Iris Setosa，Iris Versicolour，Iris Virginica。该数据集可用于多分类问题。
加载数据集其参数有：
• return_X_y:

若为True，则以（data, target）元组形式返回数据；默认为False，表示以字典形式返回数据全部信息（包括data和target）。

from sklearn.datasets import  load_iris
data = load_iris(return_X_y=True)

from sklearn.datasets import  load_iris
data = load_iris()
#查看data所具有的属性或方法
print(dir(data))
print('*'*80)
#查看数据集的描述
print(data.DESCR)
print('*'*80)
#查看数据的特征名
print(data.feature_names)
#print(data.data)
print('*'*80)
#查看数据的分类名
print(data.target_names)
print('*'*80)
print(data.target)
print('*'*80)
#查看第2、11、101个样本的目标值
print(data.target[[1,10, 100]])

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
********************************************************************************
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988
            
   '''       部分省略      '''

********************************************************************************
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
********************************************************************************
['setosa' 'versicolor' 'virginica']
********************************************************************************
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
********************************************************************************
[0 0 2]

sklearn.datasets.load_digits

手写数字数据集包括1797个0-9的手写数字数据，每个数字由8*8大小的矩阵构成，矩阵中值的范围是0-16，代表颜色的深度。
加载数据集其参数包括：
• return_X_y:若为True，则以（data, target）形式返回数据；默认为False，表示以字典形式返回数据全部信息（包括data和target）；
• n_class：表示返回数据的类别数，默认= 10，如：n_class=5,则返回0到4的数据样本。

from sklearn.datasets import load_digits
digits = load_digits(n_class=5,return_X_y=False)
#查看第1-10个样本的目标值
print(digits.target[0:10])

[0 1 2 3 4 0 1 2 3 4]

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
digits = load_digits(n_class=10,return_X_y=False)
print(dir(digits))
print('*'*80)
print(digits.DESCR)
print('*'*80)
print(digits.data)
print('*'*80)
print(digits.target_names)
print('*'*80)
print(digits.target[[2,20,200]])
print('*'*80)
print(digits.images.shape)
plt.matshow(digits.images[1])
plt.savefig('手写数字1')
plt.show()

['DESCR', 'data', 'images', 'target', 'target_names']
********************************************************************************
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998
'''       部分省略      '''
********************************************************************************
[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
********************************************************************************
[0 1 2 3 4 5 6 7 8 9]
********************************************************************************
[2 0 1]
********************************************************************************
(1797, 8, 8)

手写数字1.png

用于回归

sklearn.datasets.load_boston

波士顿房价数据集包含506组数据，每条数据包含房屋以及房屋周围的详细信息。其中包括城镇犯罪率、一氧化氮浓度、住宅平均房间数、到中心区域的加权距离以及自住房平均房价等。
波士顿房价数据集属性描述
CRIM：城镇人均犯罪率。
ZN：住宅用地超过 25000 sq.ft. 的比例。
INDUS：城镇非零售商用土地的比例。
CHAS：查理斯河空变量（如果边界是河流，则为1；否则为0）
NOX：一氧化氮浓度。
RM：住宅平均房间数。
AGE：1940 年之前建成的自用房屋比例。
DIS：到波士顿五个中心区域的加权距离。
RAD：辐射性公路的接近指数。
TAX：每 10000 美元的全值财产税率。
PTRATIO：城镇师生比例。
B：1000（Bk-0.63）^ 2，其中 Bk 指代城镇中黑人的比例。
LSTAT：人口中地位低下者的比例。
MEDV：自住房的平均房价，以千美元计。
加载数据集其参数有：
• return_X_y:

若为True，则以（data, target）元组形式返回数据；默认为False，表示以字典形式返回数据全部信息（包括data和target）。

from sklearn.datasets import load_boston
boston = load_boston()
print(dir(boston))
print('*'*80)
print(boston.DESCR)
print('*'*80)
print(boston.feature_names)
print(boston.data)
print('*'*80)
print(boston.filename)
print('*'*80)
print(boston.target)

['DESCR', 'data', 'feature_names', 'filename', 'target']
********************************************************************************
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.
'''       部分省略      '''
********************************************************************************
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
********************************************************************************
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 '''       部分省略      '''
 16.7 12.  14.6 21.4 23.  23.7 25.  21.8 20.6 21.2 19.1 20.6 15.2  7.
  8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
 22.  11.9]

sklearn.datasets.load_diabetes

from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(dir(diabetes))
print('*'*80)
print(diabetes.DESCR)
print('*'*80)
print(diabetes.data_filename)
print('*'*80)
print(diabetes.feature_names)
print(diabetes.data)
print('*'*80)
print(diabetes.target_filename)

['DESCR', 'data', 'data_filename', 'feature_names', 'target', 'target_filename']
********************************************************************************
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
'''       部分省略      '''
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_data.csv.gz
********************************************************************************
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]
********************************************************************************
D:\Anaconda3\lib\site-packages\sklearn\datasets\data\diabetes_target.csv.gz

获取大数据集

sklearn.datasets.fetch_20newsgroups
加载数据集其参数有：

subset: 'train'或者'test','all'，可选，选择要加载的数据集：训练集的“训练”，测试集的“测试”，两者的“全部”

data_home: 可选，默认值：无。指定数据集的下载路径。如果没有，所有scikit学习数据都存储在'〜/ scikit_learn_data'子文件夹中

categories: 选取哪一类数据集[类别列表]，默认20类

shuffle: 是否对数据进行随机排序

random_state: numpy随机数生成器或种子整数

download_if_missing: 可选，默认为True，如果没有下载过，重新下载

remove: ('headers','footers','quotes')删除部分文本

from sklearn.datasets import fetch_20newsgroups
data_test=fetch_20newsgroups(subset='test',data_home=None,categories=None,                          shuffle=True,random_state=42,remove=(),download_if_missing=True)

from sklearn.datasets import fetch_20newsgroups
data_test = fetch_20newsgroups(subset='test',shuffle=True,random_state=42)
data_train = fetch_20newsgroups(subset='train',shuffle=True,random_state=42)
print(dir(data_train))
print('*'*80)
#print(data_train.DESCR)
print('*'*80)
print(data_test.data[0]) #测试集中的第一篇文档
print('-'*80)
print('训练集数据分类名称：{} '.format(data_train.target_names))
print(data_test.target[:10])
print('*'*80)
print('训练集数据：{} 条'.format(data_train.target.shape))
print('测试集数据:{} 条'.format(data_test.target.shape))

['DESCR', 'data', 'filenames', 'target', 'target_names']
********************************************************************************
********************************************************************************
From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu

 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

                        Neil Gandler

--------------------------------------------------------------------------------
训练集数据分类名称：['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] 
[ 7  5  0 17 19 13 15 15  5  1]
********************************************************************************
训练集数据：(11314,) 条
测试集数据:(7532,) 条

sklearn.datasets.fetch_20newsgroups_vectorized

加载20个新闻组数据集并将其转换为tf-idf向量，这是一个方便的功能; 使用sklearn.feature_ extraction.text.Vectorizer的默认设置完成tf-idf 转换。

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.utils import shuffle
bunch = fetch_20newsgroups_vectorized(subset='all')
X,y = shuffle(bunch.data,bunch.target)
print(X.shape)
# 数据集划分为训练集0.7和测试集0.3
offset = int(X.shape[0]*0.7)
X_train, y_train = X[0:offset], y[0:offset]
X_test, y_test = X[offset:], y[offset:]
print(X_train.shape)
print(X_test.shape)

(18846, 130107)
(13192, 130107)
(5654, 130107)

获取本地生成数据

生成本地分类数据：
- sklearn.datasets.make_classification
- 加载数据集其参数有：
  
  n_samples:int，optional（default = 100)，样本数量
  
  n_features:int，可选（默认= 20），特征总数= n_informative + n_redundant + n_repeated
  
  n_informative：多信息特征的个数
  
  n_redundant：冗余信息，informative特征的随机线性组合
  
  n_repeated ：重复信息，随机提取n_informative和n_redundant 特征
  
  n_classes:int，可选（default = 2),分类类别
  
  n_clusters_per_class ：某一个类别是由几个cluster构成的
  
  random_state:int，RandomState实例，可选（默认=无）如果int，random_state是随机数生成器使用的种子
```
from sklearn import datasets
import matplotlib.pyplot as plt 
 
data,target = datasets.make_classification(n_samples=100,n_features=2,
                                           n_informative=2,n_redundant=0,n_repeated=0,
                                           n_classes=2,n_clusters_per_class=1,
                                           random_state=0)
print(data.shape)
print(target.shape)
#print(data)
#print(target)
plt.scatter(data[:,0],data[:,1],c=target)
plt.show()
```
```
(100, 2)
(100,)
```
111.png

生成本地回归数据：
- sklearn.datasets.make_regression
- 加载数据集其参数有：
  
  n_samples: int，optional（default = 100)，样本数量
  
  n_features: int,optional（default = 100)，特征数量
  
  coef: boolean，optional（default = False），如果为True，则返回底层线性模型的系数
  
  random_state: int，RandomState实例，可选（默认=无）
```
from sklearn.datasets.samples_generator import make_regression
X, y = make_regression(n_samples=100, n_features=10, random_state=1)
print(X.shape)
print(y.shape)
```
图像数据

在Anaconda中sklearn中的图像在该目录下

D:\Anaconda3\Lib\site-packages\sklearn\datasets\images

存在china.jpg和flower.jpg

from sklearn.datasets import load_sample_image
import matplotlib.pyplot as plt
img = load_sample_image('china.jpg')
plt.imshow(img)

china.png

参考资料：

网址：

https://blog.csdn.net/wangdong2017/article/details/81326341

视频：

《python机器学习应用》《黑马程序员之机器学习》

scikit-learn数据集

scikit-learn数据集

sklearn数据集

sklearn.datasets

获取小数据集

获取大数据集

获取本地生成数据

图像数据

推荐阅读更多精彩内容