From Linear Regression to Logistic Regression

Binary classification with logistic regression

  • 概率分布
  • response value represents a probablity, between [0,1]

1 . 普通的线性回归假设响应变量呈正态分布,又称高斯分布或钟形曲线(bell curve)
2 . 若响应变量不满足正态分布,而是概率事件,则假设不满足
3 . 广义线性回归,用联连函数(link function)来描述解释变量和响应变量的关系
4 . 普通线性回归作为广义线性回归的特例使用的是恒等联连函数(identity link function), 将解释变量通过线性组合来联连服从正态分布的响应变量
5 . 对于逻辑回归,如果响应变量超过某个临界值,预测结果为阳性,否则为阴性
6 . The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function。the logistic function returns a value between 0 and 1


7 . For logistic function,t is equal to a linear combination of explanatory variables

Spam filtering(垃圾短信过滤)

1 . explore data and calculate some basic summary statics using pandas

import pandas as pd
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
print ('Number of spam messages:',df[df[0]=='spam'][0].count()) 
print ('Number of ham messages',df[df[0]=='ham'][0].count())

2 . create a TfidfVectorizer, then fit it with training messages, and transform both the training and test messages

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵

3 . create an instance of LogisticRegression and train the model

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
df=pd.read_table('/Users/enniu/Desktop/SMSSpamCollection',delimiter='\t',header=None)
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0])  #25%的比例为test集,type类型为Series
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)  #生成矩阵
X_test = vectorizer.transform(X_test_raw)   #type为scipy的矩阵
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
for i, prediction in enumerate(predictions[:5]):
    print ('Prediction:%s. Truelabel:%s. Message:%s' % (prediction,y_test.iloc[i],X_test_raw.iloc[i]))    
    #此处必须使用iloc,基于位置的索引。若用X_test_raw[i]会报错,因为拆分训练、测试集时,索引也相应变了,尤其针对数字索引

Binary classification performance metrics(效果度量方法)

预测阳性 预测阴性
实际阳性 True Positive False Negative
实际阴性 False Positive True Negative
实际运行时如下,阳性在下
0 1
0
1
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
y_test = [0,0,0,0,0,1,1,1,1,1]
y_pred = [0,1,0,0,0,0,0,1,1,1]
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
plt.matshow(confusion_matrix)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

Accuracy

  • Accuracy measures a fraction of the classifier's predictions that are correct
from sklearn.metrics import accuracy_score
y_pred=[0,1,1,0]
y_true=[1,1,1,1]
print 'Accuracy:',accuracy_score(y_true,y_pred)  #outcome is 0.5
  • evaluate the classifier's accuracy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
scores = cross_val_score(classifier, X_train, y_train, cv=5)
#y_pre=classifier.predict(X_test)
#for i,pre in enumerate(y_pre[:5]):
#    print y_pre[i],y_test.iloc[i],X_test_raw.iloc[i]
print 'Accuracy',np.mean(scores), scores
#Outcome:Accuracy 0.955980861244 [ 0.94976077  0.95933014  0.96052632  0.96291866  0.94736842]
  • Drawback
    1 . accuracy can't distinguish between false positive errors and false negative errors
    2 . accuracy is not an informative metrics if the proportions of the class are skewed(倾斜) in the population

Precision and recall 精确率和召回率

  • definition
  1. the precision is the fraction of positive predictions that are correct


  2. recall is the fraction of truly positive instances that the classifier recognizes(被分类器识别出来的真阳性占所有阳性的比例)


  • calculate SMS classifier's precision and recall
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)  
precisions = cross_val_score(classifier, X_train, y_train, cv=5, scoring='precision')  #实际运行报错,不知为啥
print 'Precision', np.mean(precisions), precisions
recalls = cross_val_score(classifier, X_train, y_train, cv=5, scoring='recall')
print 'Recall', np.mean(recalls), recalls
f1s = cross_val_score(classifier, X_train, y_train, cv=5, scoring='f1')
print 'F1:', np.mean(f1s), f1s
#Outcome:
Precision 0.989910506899 [ 0.98591549  1.          0.98850575  0.98795181  0.98717949]
Recall 0.685907046477 [ 0.60344828  0.69565217  0.74782609  0.71304348  0.66956522]
F1: 0.806840977066 [ 0.84102564  0.81675393  0.8042328   0.79144385  0.78074866]

1 . Precision=0.9899 means almost all of the messages that it predicted as spam were actually spam
2 . Recall=0.686 means it incorrectly classified approximately 32 precent of the spam messages as ham

Calculating the F1 measure

ROC AUC

  • unlike accuracy,the ROC curve is insensitive to data sets with unbalanced class proportions
  • ROC curves plot the classi er's recall against its fall-out
  • Fall-out, or the false positive rate, is the number of false positives divided by the total number of negatives


  • AUC(area under curve)
    which represents the expected performance of the classifier
  • plot the ROC curve for SMS spam
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import roc_curve, auc
df=pd.read_csv('/Users/enniu/Desktop/sms.csv')
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df['message'], df['label'])
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict_proba(X_test)
false_positive_rate, recall, thresholds = roc_curve(y_test, predictions[:, 1])    #将y_test和预测值进行比较
roc_auc = auc(false_positive_rate, recall)     #计算AUC的值
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, recall, 'b', label='AUC = %0.2f' % roc_auc)   #'b'表示蓝色线条
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.ylabel('Recall')
plt.xlabel('Fall-out')
plt.show()

Tuning models with grid search(网格搜索调整模型)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75),
    'vect__stop_words': ('english', None),
    'vect__max_features': (2500, 5000, 10000, None),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__use_idf': (True, False),
    'vect__norm': ('l1', 'l2'),
    'clf__penalty': ('l1', 'l2'),
    'clf__C': (0.01, 0.1, 1, 10),
}
if __name__ == "__main__":
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy', cv=3)
    df = pd.read_csv('/Users/enniu/Desktop/sms.csv')
    X, y, = df['message'], df['label']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])
    predictions = grid_search.predict(X_test)
    print 'Accuracy:', accuracy_score(y_test, predictions)
    print 'Precision:', precision_score(y_test, predictions)
    print 'Recall:', recall_score(y_test, predictions)
# The following is the output of the script:
Fitting 3 folds for each of 1536 candidates, totalling 4608 fits
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   23.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   52.3s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done 4608 out of 4608 | elapsed: 12.4min finished
Best score: 0.985
Best parameters set:
    clf__C: 10
    clf__penalty: 'l2'
    vect__max_df: 0.25
    vect__max_features: 2500
    vect__ngram_range: (1, 2)
    vect__norm: 'l2'
    vect__stop_words: None
    vect__use_idf: True
Accuracy: 0.98493543759
Precision: 0.983333333333
Recall: 0.907692307692

Multi-class classification

  • One-vs.-all classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容