当初翻scikit learn文档的时候,越翻越多,干脆把它的教程拿出来看了看,只有前面的部分,主要想看看scikit learn角度整理的知识体系,果然,一开始就是从监督和非监督讲起:
Machine learning
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data.
Learning problems fall into a few categories:
supervised learning, in which the data comes with additional attributes that we want to predict. This problem can be either:
- classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
- regression: if the desired output consists of one or more continuous variables, then the task is called regression.
unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values.
- clustering: The goal in such problems may be to discover groups of similar examples within the data
- density estimation: to determine the distribution of data within the input space
- down dimensional: project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.
see more dataset load method
data, targets
A dataset
is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the .data
member, which is a n_samples, n_features array. In the case of supervised problem, one or more response variables are stored in the .target
In scikit-learn, an estimator
for classification is a Python object that implements the methods fit(X, y)
and predict(T)
, an estimator is any object that learns from data
An example of an estimator is the class sklearn.svm.SVC
, which implements support vector classification
- estimator.param1 表示传入的参数
- estimator.param1_ 表示estimated param
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
# digits.data.shape, (1797, 64)
# digits.images.shape, (1797, 8, 8)
import matplotlib.pyplot as plt
plt.imshow(digits.images[-1], cmap='gray')
from sklearn import svm
# first, we treat the estimator as a black box, and set params manually
# or we can use `grid search` and `cross validation` to determine the best params
clf = svm.SVC(gamma=0.001, C=100.)
# train(or learn) from all (except last one) digits
# validate with the last digit
clf.fit(digits.data[:-1], digits.target[:-1])
# better practise
from sklearn.model_selection import train_test_split
# flatten the images
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
clf = svm.SVC(gamma=0.001)
# Split data into 50% train and 50% test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# Learn the digits on the train subset
clf.fit(X_train, y_train)
# Predict the value of the digit on the test subset
predicted = clf.predict(X_test)
classification_report builds a text report showing the main classification metrics.
confusion matrix
plot_confusion_matrix can e used to visually represent a confusion matrix:
- 预测Positive共4例,成功3,失败1
- 预测Negative共6例,成功4,失败2
- 准确率:7/10 (TP+TN/total) 根据上文的文字描述,其实就是判断成功的次数
- 精确率:3/4 (TP/TP+FP) 即只关注一个指标(等于是竖向统计),比如正例,或负例,然后观察它错了多少。
- 本例中,只预测了4个正,就错了一个
- 召回率:3/5 (TP/TP+FN) 仍然只关注一个指标,比如正例,但是召回率关心你把所有的“正例“找出来多少
- 也就是说,如果你把所有的样本判断为正例,召回率可达100%
from sklearn import metrics
print(f"Classification report for classifier {clf}:\n"
f"{metrics.classification_report(y_test, predicted)}\n")
Classification report for classifier SVC(gamma=0.001):
precision recall f1-score support
0 1.00 0.99 0.99 88
1 0.99 0.97 0.98 91
2 0.99 0.99 0.99 86
3 0.98 0.87 0.92 91
4 0.99 0.96 0.97 92
5 0.95 0.97 0.96 91
6 0.99 0.99 0.99 91
7 0.96 0.99 0.97 89
8 0.94 1.00 0.97 88
9 0.93 0.98 0.95 92
accuracy 0.97 899
macro avg 0.97 0.97 0.97 899
weighted avg 0.97 0.97 0.97 899
disp = metrics.plot_confusion_matrix(clf, X_test, y_test)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")
Type Casting
Unless otherwise specified, input will be cast to float64:
>>> import numpy as np
>>> from sklearn import random_projection
>>> rng = np.random.RandomState(0)
>>> X = rng.rand(10, 2000)
>>> X = np.array(X, dtype='float32')
>>> X.dtype
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
the example above, the float32
X is casst to float64
by fit_transform(X)
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
clf = SVC()
clf.fit(iris.data, iris.target)
# fit string
clf.fit(iris.data, iris.target_names[iris.target])
# ['setosa', 'setosa', 'setosa']
[0, 0, 0]
['setosa', 'setosa', 'setosa']
Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the set_params()
. then you call fit()
, the learned will be overwrite.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import SVC
X, y = load_iris(return_X_y=True) # 注意换了种load方式
clf = SVC()
clf.set_params(kernel='linear').fit(X, y)
print('linear', clf.predict(X[:5]))
clf.set_params(kernel='rbf').fit(X, y)
print('rbf', clf.predict(X[:5]))
linear [0 0 0 0 0]
rbf [0 0 0 0 0]
Multiclass vs. multilabel fitting
When using multiclass classifiers, the learning and prediction task that is performed is dependent on the format of the target data fit upon:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer
X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
y = [0, 0, 1, 1, 2]
# 注意一行的写法
classif = OneVsRestClassifier(estimator=SVC(random_state=0))
print('1d:', classif.fit(X, y).predict(X))
# one-hot
y = LabelBinarizer().fit_transform(y)
print('y:', y)
print('one-hot:', classif.fit(X, y).predict(X)) # 可以看到,已经开始不准确了b
1d: [0 0 1 1 2]
y: [[1 0 0]
[1 0 0]
[0 1 0]
[0 1 0]
[0 0 1]]
one-hot: [[1 0 0]
[1 0 0]
[0 1 0]
[0 0 0]
[0 0 0]]
multiple label
from sklearn.preprocessing import MultiLabelBinarizer
y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]] # 一个instance被赋予多个label,(甚至第4个有3个label)
y = MultiLabelBinarizer().fit_transform(y)
classif.fit(X, y).predict(X)
array([[1, 1, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[1, 0, 1, 0, 0],
[1, 0, 1, 0, 0]])
KNN (k nearest neighbors classification)
import numpy as np
from sklearn import datasets
iris_X, iris_y = datasets.load_iris(return_X_y=True)
# Split iris data in train and test data
# A random permutation, to split the data randomly
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)
[1 2 1 0 0 0 2 1 2 0]
[1 1 1 0 0 0 2 1 2 0]
The curse of dimensionality
nearest neighbor算法,维数越高,需要的数据越多,才能保证在一点的附近有足够多的neighbor。所以一般来说当特征很多时KNN的效果会下降。当然也有例外,某次做一个20个特征的KNN时候,结果居然比随机森林还要好(:з」∠)场面一度十分尴尬… 参考
If there are few data points per dimension, noise in the observations induces high variance:
train with very few data
from sklearn import linear_model
X = np.c_[ .5, 1].T
y = [.5, 1]
test = np.c_[ 0, 2].T
regr = linear_model.LinearRegression()
import matplotlib.pyplot as plt
for _ in range(6):
this_X = .1 * np.random.normal(size=(2, 1)) + X
regr.fit(this_X, y)
plt.plot(test, regr.predict(test))
plt.scatter(this_X, y, s=20)
Ridge regression:
A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge regression.
This is an example of bias/variance
the larger the ridge alpha
- the higher the
- the lower the
lasso 回归和岭回归(ridge regression)其实就是在标准线性回归的基础上分别加入 L1 和 L2 正则化(regularization)。相比直接把一些特征的系数置零,只是把它们的“贡献”变小,即乘一下较低的权重(惩罚,imposing a penalty on the size of the coefficients)。
Lasso 更多用于估计稀疏样本的系数。
以下关于几个加了正则的demo和调优是整理笔记整理岔了,不是官方教程里的,但是也是我的学习笔记,正好演示一些demo和cross validation的用法就不删了。
- L1-norm (Lasso)
- L2-norm (Ridge)
- (Elastic Net) (l1+l2)
from sklearn.linear_model import Ridge
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston() # 还记得上一节课 load_iris() 吗?
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
model = Ridge(alpha=0.01, normalize=True) # 用岭回归构建模型
model.fit(X_train, y_train) # 拟合
train_score = model.score(X_train, y_train) # 模型对训练样本得准确性
test_score = model.score(X_test, y_test) # 模型对测试集的准确性
print(boston.data[:5], boston.target[:5])
print(f"train_score: {train_score}, test_score: {test_score}")
[[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
[2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
[3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
[6.9050e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 7.1470e+00
5.4200e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9690e+02
5.3300e+00]] [24. 21.6 34.7 33.4 36.2]
train_score: 0.723706995939315, test_score: 0.7926416423787221
- Ridge regression is a penalized linear regression model for predicting a numerical value
- and it can be very effective when applied to classification
- the important parameter to tune is the regularization strength (
) in (0.1, 1.0) step = 0.1
from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
# 用redge cross validation建模而不是Ridge
model = RidgeCV(alphas=[1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001], normalize=True)
model.fit(X_train, y_train)
lass demo和调优
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
print("lasso score", lasso_reg.score(X_test, y_test))
# 调优
lscv = LassoCV(alphas=(1.0, 0.1, 0.01, 0.001, 0.005, 0.0025, 0.001, 0.00025), normalize=True)
lscv.fit(X, y)
print('Lasso optimal alpha: %.3f' % lscv.alpha_)
lasso score 0.7956864030940746
Lasso optimal alpha: 0.010
from sklearn.linear_model import ElasticNet
e_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
e_net.fit(X, y)
print("e_net score:", e_net.score(X_test, y_test))
# 调优
encv = ElasticNetCV(alphas=(0.1, 0.01, 0.005, 0.0025, 0.001),
l1_ratio=(0.1, 0.25, 0.5, 0.75, 0.8),
encv.fit(X, y)
print('ElasticNet optimal alpha: %.3f and L1 ratio: %.4f' % (encv.alpha_, encv.l1_ratio_))
e_net score: 0.7926169728251697
ElasticNet optimal alpha: 0.001 and L1 ratio: 0.5000
regr = linear_model.Ridge(alpha=.1)
for _ in range(6):
this_X = .1 * np.random.normal(size=(2, 1)) + X
regr.fit(this_X, y)
plt.plot(test, regr.predict(test))
plt.scatter(this_X, y, s=20)
# 观察图像的不同,其实可以理解为样本过少时的”过拟合“,引入忽略的指标后虽然对训练集的准确率大打折扣,但确实降低了方差
Diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# observe the alpha and the score:
alphas = np.logspace(-4, -1, 6) # log10(-4)到log10(-1)共6个数做alpha
print([f'{regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train).score(diabetes_X_test, diabetes_y_test) * 100:.2f}%'
for alpha in alphas])
[0.0001 0.00039811 0.00158489 0.00630957 0.02511886 0.1 ]
['58.51%', '58.52%', '58.55%', '58.56%', '58.31%', '57.06%']
Lasso regression
Lasso = least absolute shrinkage and selection operator
相比Ridge, Lasso会真的把一些feature系数置0 (sparse method),适用奥卡姆剃刀原理(Occam’s razor: prefer simpler models)
regr = linear_model.Lasso()
scores = [regr.set_params(alpha=alpha)
.fit(diabetes_X_train, diabetes_y_train)
.score(diabetes_X_test, diabetes_y_test)
for alpha in alphas]
best_alpha = alphas[scores.index(max(scores))]
regr.alpha = best_alpha # 不链式调用的话不需要用set_params
regr.fit(diabetes_X_train, diabetes_y_train)
[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982
-0. -187.19554705 69.38229038 508.66011217 71.84239008]
Different algorithms for the same problem
Different algorithms can be used to solve the same mathematical problem. For instance the Lasso
object in scikit-learn solves the lasso regression problem using a coordinate descent
method, that is efficient on large datasets. However, scikit-learn also provides the LassoLars
object using the LARS
algorithm, which is very efficient for problems in which the weight vector estimated is very sparse (i.e. problems with very few observations).
Logistic Regerssion
log = linear_model.LogisticRegression(C=1e5)
log.fit(iris_X_train, iris_y_train)
- The
parameter controls the amount of regularization in the LogisticRegression object:- a large value for
results in less regularization.
- a large value for
gives Shrinkage (i.e. non-sparse coefficients), -
gives Sparsity.
from sklearn import datasets, neighbors, linear_model
X_digits, y_digits = datasets.load_digits(return_X_y=True)
X_digits = X_digits / X_digits.max()
n_samples = len(X_digits)
X_train = X_digits[:int(.9 * n_samples)]
y_train = y_digits[:int(.9 * n_samples)]
X_test = X_digits[int(.9 * n_samples):]
y_test = y_digits[int(.9 * n_samples):]
knn = neighbors.KNeighborsClassifier()
logistic = linear_model.LogisticRegression(max_iter=300)
print('KNN score: %f' % knn.fit(X_train, y_train).score(X_test, y_test))
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))
KNN score: 0.961111
LogisticRegression score: 0.933333
Support vector machines (SVMs)
支持向量机(support vector machines,SVM)是一种二分类模型,它的基本模型是定义在特征空间上的间隔最大的线性分类器。除此之外,SVM算法还包括核函数,核函数可以使它成为非线性分类器。
Support Vector Machines
belong to the discriminant model family:
they try to find a combination of samples to build a plane maximizing the margin between the two classes. Regularization
is set by the C
- a small value for C means the margin is calculated using many or all of the observations around the separating line (more regularization);
- a large value for C means the margin is calculated on observations close to the separating line (less regularization).
SVMs can be used:
- in regression –
(Support Vector Regression)–, - or in classification –
(Support Vector Classification).
- C是惩罚系数,即对误差的宽容度。c越高,说明越不能容忍出现误差,容易过拟合。C越小,容易欠拟合。C过大或过小,泛化能力变差
- gamma是选择RBF函数作为kernel后,该函数自带的一个参数。隐含地决定了数据映射到新的特征空间后的分布,gamma越大,支持向量越少,gamma值越小,支持向量越多。支持向量的个数影响训练与预测的速度。
Using kernels
Classes are not always linearly separable in feature space. The solution is to build a decision function
that is not linear but may be polynomial instead.
This is done using the kernel
trick that can be seen as creating a decision energy by positioning kernels on observations:
Linear kernal
svc = svm.SVC(kernel='linear')
Polynomial kernel
svc = svm.SVC(kernel='poly', degree=3)
RBF kernel (Radial Basis Function)
svc = svm.SVC(kernel='rfb')
DEMO: Plot different SVM classifiers in the iris dataset
- LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss.
- LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVC uses the One-vs-One multiclass reduction.
# 代码片段,定义4个estimator
C = 1.0 # SVM regularization parameter
models = (svm.SVC(kernel='linear', C=C),
svm.LinearSVC(C=C, max_iter=10000),
svm.SVC(kernel='rbf', gamma=0.7, C=C),
svm.SVC(kernel='poly', degree=3, gamma='auto', C=C))
models = (clf.fit(X, y) for clf in models)
cross validation
RAW DEMO: KFold cross-validation:
import numpy as np
from sklearn import datasets, svm
X_digits, y_digits = datasets.load_digits(return_X_y=True)
svc = svm.SVC(C=1, kernel='linear')
score = svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
X_folds = np.array_split(X_digits, 3) # 分成了3个fold
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
# We use 'list' to copy, in order to 'pop' later on
X_train = list(X_folds)
X_test = X_train.pop(k) # 取出最后一个fold # 不对,python居然可以Pop任意一个索引
X_train = np.concatenate(X_train) # 把剩下的fold拼回去
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]
from sklearn.model_selection import KFold, cross_val_score
# step 1: 用KFold和fold数做一个KFold对象,然后用这个KFold对象去循环(其实就是一个generator)
# step 2: 每次循环自己手动计算score
k_fold = KFold(n_splits=3)
scores = [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) \
for train, test in k_fold.split(X_digits)]
[0.9348914858096828, 0.9565943238731218, 0.9398998330550918]
# 把KFolder对象传入即可
scores = cross_val_score(svc, X_digits, y_digits, cv=k_fold)
# 定制scoring method:
scores = cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro')
[0.93489149 0.95659432 0.93989983]
[0.93969761 0.95911415 0.94041254]
Cross-validation generators:
see more cross-validation generators:
Datatransformation with held out data
>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)
>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_val_score(clf, X, y, cv=cv)
array([0.977..., 0.933..., 0.955..., 0.933..., 0.977...])
cross_validate v.s. cross_val_score
The cross_validate function differs from cross_val_score in two ways:
- It allows specifying multiple metrics for evaluation.
- It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
>>> from sklearn.model_selection import cross_validate
>>> from sklearn.metrics import recall_score
>>> scoring = ['precision_macro', 'recall_macro']
>>> clf = svm.SVC(kernel='linear', C=1, random_state=0)
>>> scores = cross_validate(clf, X, y, scoring=scoring) # 以字典返回validate几个指标
>>> sorted(scores.keys())
['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
>>> scores['test_recall_macro']
array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])
grid search
对不同参数进行组合遍历,目的是为了maximize the cross-validation score
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])
print('best score:', clf.best_score_)
print('best estimator.c:', clf.best_estimator_.C)
# Prediction performance on test set is not as good as on train set
score = clf.score(X_digits[1000:], y_digits[1000:])
print('score:', score)
best score: 0.95
best estimator.c: 0.0021544346900318843
score: 0.946047678795483
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True)
lasso.fit(X_diabetes, y_diabetes)
# The estimator chose automatically its lambda:
Unsupervised learning: seeking representations of the data
Clustreing: grouping observations together
K-means clustreing
from sklearn import cluster, datasets
X_iris, y_iris = datasets.load_iris(return_X_y=True)
k_means = cluster.KMeans(n_clusters=3)
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
array([[5.006 , 3.428 , 1.462 , 0.246 ],
[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])