From Linear Regression to Logistic Regression

Binary classification with logistic regression

概率分布
response value represents a probablity, between [0,1]

1 . 普通的线性回归假设响应变量呈正态分布，又称高斯分布或钟形曲线（bell curve）
2 . 若响应变量不满足正态分布，而是概率事件，则假设不满足
3 . 广义线性回归，用联连函数（link function）来描述解释变量和响应变量的关系
4 . 普通线性回归作为广义线性回归的特例使用的是恒等联连函数（identity link function）, 将解释变量通过线性组合来联连服从正态分布的响应变量
5 . 对于逻辑回归，如果响应变量超过某个临界值，预测结果为阳性，否则为阴性
6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1

7 . For logistic function，t is equal to a linear combination of explanatory variables

Spam filtering（垃圾短信过滤）

1 . explore data and calculate some basic summary statics using pandas

import pandas as pd
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
print ('Number of spam messages:',df[df[0]=='spam'][0].count()) 
print ('Number of ham messages',df[df[0]=='ham'][0].count())

2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵

3 . create an instance of LogisticRegression and train the model

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
    print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))    
    #此处必须使用iloc,基于位置的索引。若用X_test_raw[i]会报错，因为拆分训练、测试集时，索引也相应变了，尤其针对数字索引

Binary classification performance metrics(效果度量方法)

	预测阳性	预测阴性
实际阳性	True Positive	False Negative
实际阴性	False Positive	True Negative

实际运行时如下，阳性在下

	0	1
0
1

from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0,0,0,0,0,1,1,1,1,1]
y_pred = [0,1,0,0,0,0,0,1,1,1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Accuracy

Accuracy measures a fraction of the classifier's predictions that are correct

from sklearn.metrics import accuracy_score
y_pred=[0,1,1,0]
y_true=[1,1,1,1]
print 'Accuracy:',accuracy_score(y_true,y_pred)  #outcome is 0.5

evaluate the classifier's accuracy

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
#y_pre=classifier.predict(X_test)
#for i,pre in enumerate(y_pre[:5]):
#    print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
print 'Accuracy',np.mean(scores), scores
#Outcome:Accuracy 0.955980861244 [ 0.94976077  0.95933014  0.96052632  0.96291866  0.94736842]

Drawback
1 . accuracy can't distinguish between false positive errors and false negative errors
2 . accuracy is not an informative metrics if the proportions of the class are skewed(倾斜) in the population

Precision and recall 精确率和召回率

definition

the precision is the fraction of positive predictions that are correct
recall is the fraction of truly positive instances that the classifier recognizes(被分类器识别出来的真阳性占所有阳性的比例)

calculate SMS classifier's precision and recall

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)  
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')  #实际运行报错，不知为啥
print 'Precision', np.mean(precisions), precisions
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Recall', np.mean(recalls), recalls
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print 'F1:', np.mean(f1s), f1s
#Outcome:
Precision 0.989910506899 [ 0.98591549  1.          0.98850575  0.98795181  0.98717949]
Recall 0.685907046477 [ 0.60344828  0.69565217  0.74782609  0.71304348  0.66956522]
F1: 0.806840977066 [ 0.84102564  0.81675393  0.8042328   0.79144385  0.78074866]

1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham

Calculating the F1 measure

ROC AUC

unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
ROC curves plot the classi er's recall against its fall-out
Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives
AUC(area under curve)
which represents the expected performance of the classifier
plot the ROC curve for SMS spam

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])    #将y_test和预测值进行比较
roc_auc = auc(false_positive_rate, recall)     #计算AUC的值
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)   #'b'表示蓝色线条
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning models with grid search(网格搜索调整模型)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75),
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, 10000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'vect__norm': ('l1', 'l2'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X, y, = df['message'], df['label']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])
    predictions = grid_search.predict(X_test)
    print 'Accuracy:', accuracy_score(y_test, predictions)
    print 'Precision:', precision_score(y_test, predictions)
    print 'Recall:', recall_score(y_test, predictions)
# The following is the output of the script：
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   23.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   52.3s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
Best score: 0.985
Best parameters set:
    clf__C: 10
    clf__penalty: 'l2'
    vect__max_df: 0.25
    vect__max_features: 2500
    vect__ngram_range: (1, 2)
    vect__norm: 'l2'
    vect__stop_words: None
    vect__use_idf: True
Accuracy: 0.98493543759
Precision: 0.983333333333
Recall: 0.907692307692

Multi-class classification

One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance