predicting pulsar star in the universe by pavanraj159 (原文链接)
预测宇宙中的脉冲星 (英文翻译)
原文:
HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey .
Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .
As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.
翻译:
HTRU2是一个数据集,它记载了在1.高时间分辨率宇宙测量期间收集的脉冲星候选样本。
脉冲星是一种罕见的中子星,它能产生在地球上可以探测到的无线电辐射。脉冲星作为时空、星间介质和物质状态的探测器,具有相当大的科学价值。
当脉冲星旋转时,它发射的光束会扫过天空,当光束穿过我们的视线时,就会产生一种可检测到的宽带无线电辐射模式。而当脉冲星快速旋转时,这种模式会周期性地重复。因此,脉冲星搜索要使用大型射电望远镜寻找周期性的无线电信号。
1.高时间分辨率宇宙测量 :一个起始于2008年11月雄心勃勃(~6000小时)的项目,为了测量整个南半天球的无线电脉冲和快速脉冲群(<1秒)。
高时间分辨率宇宙(HTRU)参考论文
原文:
Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.
Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class.
The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators.
Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).
翻译:
每颗脉冲星产生的辐射模式略有不同,这是由于每次旋转变化不同。 因此,通过观测到长度确定的脉冲星多次旋转的平均值,可以检测到被称为“候选星体”的潜在信号。 在缺乏额外信息的情况下,每个候选星体都有可能成为一颗真正的脉冲星。 然而在实际中,几乎所有的检测都是由射频干扰(RFI)和噪声引起的,使得合法信号难以被发现。
机器学习工具现在被用来自动标记脉冲星候选星体,以便于做快速分析。 当中特别是分类系统被广泛采用,它将候选数据集视为二进制分类问题。 本文中合法的脉冲星样本是少数正类,而虚假的样本是多数负类。
这里共享的数据集包含16259个由射频干扰(RFI)或噪声引起的虚假样本,以及1639个真实的脉冲星样本。 这些真实样本都经过了人工注释者的检查。
首先每一行都列出了变量,类标签放在最后一列,使用的是0(负类)和1(正类)。
原文:
Attribute Information
Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency . The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:
- Mean of the integrated profile.
- Standard deviation of the integrated profile.
- Excess kurtosis of the integrated profile.
- Skewness of the integrated profile.
- Mean of the DM-SNR curve.
- Standard deviation of the DM-SNR curve.
- Excess kurtosis of the DM-SNR curve.
- Skewness of the DM-SNR curve.
- Class
HTRU 2 Summary 17,898 total examples. 1,639 positive examples. 16,259 negative examples.
翻译:
属性信息
每个候选星体被描述为8个连续变量和1个类变量。 前4个连续变量是通过简单统计整合后的脉冲轮廓(折叠轮廓)获得的。这是一个连续变量的数组,它描述了以1.天文经度为基础,信号同时在时间和频率被平均后的一种变化形式。而其余4个连续变量也类似从DM-SNR(频散量-信噪比)曲线得到。这些内容概述如下:
- 整合轮廓的平均值。
- 整合轮廓的标准差。
- 整合轮廓的超峰度。
- 整合轮廓的偏斜度。
- DM-SNR曲线的平均值。
- DM-SNR曲线的标准差。
- DM-SNR曲线的超峰度。
- DM-SNR曲线的偏斜度。
- 类别
HTRU2摘要:总共17898样本,1639正类样本,16259个负类样本。
Data
数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import itertools
warnings.filterwarnings("ignore")
%matplotlib inline
from PIL import Image
1.seaborn:Seaborn 是一个基于 matplotlib 且数据结构与 pandas 统一的统计图制作库,旨在以数据可视化为中心来挖掘与理解数据。
seaborn 0.9 中文文档
2.warnings:python开发中经常遇到报错的情况,但是warning通常并不影响程序的运行,运行下列语句控制警告错误的输出。
import warnings
warnings.filterwarnings("ignore")
3.itertools:Python内置的迭代器函数。
itertools 中文文档
廖雪峰itertools教程
data = pd.read_csv(r"../input/predicting-a-pulsar-star/pulsar_stars.csv")
data.head()
Data dimensions
数据维度
print ("Number of rows :",data.shape[0])
print ("Number of columns :",data.shape[1])
Data Information
数据信息
print ("data info :",data.info())
Missing values
缺失值
print (data.isnull().sum())
Data summary
数据摘要
plt.figure(figsize=(12,8))
sns.heatmap(data.describe()[1:].transpose(), //矩形数据集
annot=True, //为True,在每个热力图单元格中写入数据值
linecolor="w",
linewidth=2,cmap=sns.color_palette("Set2"))
plt.title("Data summary")
plt.show()
1.seaborn.heatmap:将矩形数据绘制为颜色编码矩阵。
seaborn.heatmap中文文档
CORRELATION BETWEEN VARIABLES
变量间的相关性
correlation = data.corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation,annot=True,
cmap=sns.color_palette("magma"),
linewidth=2,edgecolor="k")
plt.title("CORRELATION BETWEEN VARIABLES")
plt.show()
Proportion of target variable in dataset
数据集中目标变量的比例
plt.figure(figsize=(12,6))
plt.subplot(121)
ax = sns.countplot(y = data["target_class"],
palette=["r","g"],
linewidth=1,
edgecolor="k"*2)
for i,j in enumerate(data["target_class"].value_counts().values):
ax.text(.7,i,j,weight = "bold",fontsize = 27)
plt.title("Count for target variable in datset")
plt.subplot(122)
plt.pie(data["target_class"].value_counts().values,
labels=["not pulsar stars","pulsar stars"],
autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()
COMPARING MEAN & STANDARD DEVIATION BETWEEN ATTRIBUTES FOR TARGET CLASSES
比较目标类属性间的平均值和标准差
compare = data.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve','skewness_dmsnr_curve']].mean().reset_index()
compare = compare.drop("target_class",axis =1)
compare.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")
compare1 = data.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve','skewness_dmsnr_curve']].std().reset_index()
compare1 = compare1.drop("target_class",axis=1)
compare1.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.show()
compare_mean = compare.transpose().reset_index()
compare_mean = compare_mean.rename(columns={'index':"features", 0:"not_star", 1:"star"})
plt.figure(figsize=(13,14))
plt.subplot(211)
sns.pointplot(x= "features",y="not_star",data=compare_mean,color="r")
sns.pointplot(x= "features",y="star",data=compare_mean,color="g")
plt.xticks(rotation =60)
plt.xlabel("")
plt.grid(True,alpha=.3)
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")
compare_std = compare1.transpose().reset_index()
compare_std = compare_std.rename(columns={'index':"features", 0:"not_star", 1:"star"})
plt.subplot(212)
sns.pointplot(x= "features",y="not_star",data=compare_std,color="r")
sns.pointplot(x= "features",y="star",data=compare_std,color="g")
plt.xticks(rotation =60)
plt.grid(True,alpha=.3)
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.subplots_adjust(hspace =.4)
print ("[GREEN == STAR , RED == NOTSTAR]")
plt.show()
compare_mean
plt.figure(figsize=(10,10))
plt.subplot(211)
sns.barplot(y="features",x="not_star",
data=compare_mean,color="r")
sns.barplot(y="features",x="star",
data=compare_mean,color="g")
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")
plt.subplot(212)
sns.barplot(y="features",x="star",
data=compare_std,color="g")
sns.barplot(y="features",x="not_star",
data=compare_std,color="r")
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.subplots_adjust(wspace =.5)
DISTIBUTION OF VARIABLES IN DATA SET
数据集中变量的分布
columns = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve',
'skewness_dmsnr_curve']
length = len(columns)
colors = ["r","g","b","m","y","c","k","orange"]
plt.figure(figsize=(13,20))
for i,j,k in itertools.zip_longest(columns,range(length),colors):
plt.subplot(length/2,length/4,j+1)
sns.distplot(data[i],color=k)
plt.title(i)
plt.subplots_adjust(hspace = .3)
plt.axvline(data[i].mean(),color = "k",linestyle="dashed",label="MEAN")
plt.axvline(data[i].std(),color = "b",linestyle="dotted",label="STANDARD DEVIATION")
plt.legend(loc="upper right")
print ("***************************************")
print ("DISTIBUTION OF VARIABLES IN DATA SET")
print ("***************************************")
1.可视化数据集的分布:在seaborn中想要快速查看单变量分布的最方便的方法是使用distplot()函数。
2.seaborn.distplot中文文档
PAIR PLOT BETWEEN ALL VARIABLES
所有变量间的相关矩阵图
sns.pairplot(data,hue="target_class")
plt.title("pair plot for variables")
plt.show()
Scatter plot between variable for target classes
目标类变量间的散点图
plt.figure(figsize=(14,7))
plt.subplot(121)
plt.scatter(x = "kurtosis_profile",y = "skewness_profile",
data=data[data["target_class"] == 1],alpha=.7,
label="pulsar stars",s=30,color = "g",linewidths=.4,edgecolors="black")
plt.scatter(x = "kurtosis_profile",y = "skewness_profile",
data=data[data["target_class"] == 0],alpha=.6,
label="not pulsar stars",s=30,color ="r",linewidths=.4,edgecolors="black")
plt.axvline(data[data["target_class"] == 1]["kurtosis_profile"].mean(),
color = "g",linestyle="dashed",label="mean pulsar star")
plt.axvline(data[data["target_class"] == 0]["kurtosis_profile"].mean(),
color = "r",linestyle="dashed",label ="mean non pulsar star")
plt.axhline(data[data["target_class"] == 1]["skewness_profile"].mean(),
color = "g",linestyle="dashed")
plt.axhline(data[data["target_class"] == 0]["skewness_profile"].mean(),
color = "r",linestyle="dashed")
plt.legend(loc ="best")
plt.xlabel("kurtosis profile")
plt.ylabel("skewness profile")
plt.title("Scatter plot for skewness and kurtosis for target classes")
plt.subplot(122)
plt.scatter(x = "skewness_dmsnr_curve",y = 'kurtosis_dmsnr_curve',
data=data[data["target_class"] == 0],alpha=.7,
label="not pulsar stars",s=30,color ="r",linewidths=.4,edgecolors="black")
plt.scatter(x = "skewness_dmsnr_curve",y = 'kurtosis_dmsnr_curve',
data=data[data["target_class"] == 1],alpha=.7,
label="pulsar stars",s=30,color = "g",linewidths=.4,edgecolors="black")
plt.axvline(data[data["target_class"] == 1]["kurtosis_dmsnr_curve"].mean(),
color = "g",linestyle="dashed",label ="mean pulsar star")
plt.axvline(data[data["target_class"] == 0]["kurtosis_dmsnr_curve"].mean(),
color = "r",linestyle="dashed",label ="mean non pulsar star")
plt.axhline(data[data["target_class"] == 1]["skewness_dmsnr_curve"].mean(),
color = "g",linestyle="dashed")
plt.axhline(data[data["target_class"] == 0]["skewness_dmsnr_curve"].mean(),
color = "r",linestyle="dashed")
plt.legend(loc ="best")
plt.xlabel("skewness_dmsnr_curve")
plt.ylabel('kurtosis_dmsnr_curve')
plt.title("||Scatter plot for skewness and kurtosis of dmsnr_curve for target classes")
plt.subplots_adjust(wspace =.4)
BOXPLOT FOR VARIABLES IN DATA SET WITH TARGET CLASS
数据集中各变量对应目标类的箱线图
columns = [x for x in data.columns if x not in ["target_class"]]
length = len(columns)
plt.figure(figsize=(13,20))
for i,j in itertools.zip_longest(columns,range(length)):
plt.subplot(4,2,j+1)
sns.lvplot(x=data["target_class"],y=data[i],palette=["orangered","lime"])
plt.title(i)
plt.subplots_adjust(hspace=.3)
plt.axhline(data[i].mean(),linestyle = "dashed",color ="k",label ="Mean value for data")
plt.legend(loc="best")
print ("****************************************************")
print ("BOXPLOT FOR VARIABLES IN DATA SET WITH TARGET CLASS")
print ("****************************************************")
1.lvplot(增强箱图):最早在Heike Hofmann于2011年提出的时候被称为Letter-value plots,现在在最新的seaborn 0.9.0版中,相应的函数被更名为seaborn.boxenplot()。
2.seaborn.boxenplot英文文档
Area plot for attributes of pulsar stars vs non pulsar stars
脉冲星与非脉冲星中各属性的面积图
st = data[data["target_class"] == 1].reset_index()
nst= data[data["target_class"] == 0].reset_index()
new = pd.concat([nst,st]).reset_index()
plt.figure(figsize=(13,10))
plt.stackplot(new.index,new["mean_profile"],
alpha =.5,color="b",labels=["mean_profile"])
plt.stackplot(new.index,new["std_profile"],
alpha=.5,color="r",labels=["std_profile"])
plt.stackplot(new.index,new["skewness_profile"],
alpha=.5,color ="g",labels=["skewness_profile"])
plt.stackplot(new.index,new["kurtosis_profile"],
alpha=.5,color = "m",labels=["kurtosis_profile"])
plt.axvline(x=16259,color = "black",linestyle="dashed",
label = "separating pulsars vs non pulsars")
plt.axhline(new["mean_profile"].mean(),color = "b",
linestyle="dashed",label = "average mean profile")
plt.axhline(new["std_profile"].mean(),color = "r",
linestyle="dashed",label = "average std profile")
plt.axhline(new["skewness_profile"].mean(),color = "g",
linestyle="dashed",label = "average skewness profile")
plt.axhline(new["kurtosis_profile"].mean(),color = "m",
linestyle="dashed",label = "average kurtosis profile")
plt.legend(loc="best")
plt.title("Area plot for attributes for pulsar stars vs non pulsar stars")
plt.show()
Area plot for dmsnr_curve attributes of pulsar stars vs non pulsar star
脉冲星与非脉冲星中各DM-SNR曲线属性的面积图
plt.figure(figsize=(13,10))
plt.stackplot(new.index,new["mean_dmsnr_curve"],
color="b",alpha=.5,labels=["mean_dmsnr_curve"])
plt.stackplot(new.index,new["std_dmsnr_curve"],
color="r",alpha=.5,labels=["std_dmsnr_curve"])
plt.stackplot(new.index,new["skewness_dmsnr_curve"],color="g",
alpha=.5,labels=["skewness_dmsnr_curve"])
plt.stackplot(new.index,new["kurtosis_dmsnr_curve"],color="m",
alpha=.5,labels=["kurtosis_dmsnr_curve"])
plt.axvline(x=16259,color = "black",linestyle="dashed",
label = "separating pulsars vs non pulsars")
plt.axhline(new["mean_dmsnr_curve"].mean(),color = "b",linestyle="dashed",
label = "average mean dmsnr_curve")
plt.axhline(new["std_dmsnr_curve"].mean(),color = "r",
linestyle="dashed",label = "average std dmsnr_curve")
plt.axhline(new["skewness_dmsnr_curve"].mean(),color = "g",
linestyle="dashed",label = "average skewness dmsnr_curve")
plt.axhline(new["kurtosis_dmsnr_curve"].mean(),color = "m",
linestyle="dashed",label = "average kurtosis dmsnr_curve")
plt.legend(loc="best")
plt.title("Area plot for dmsnr_curve attributes for pulsar stars vs non pulsar stars")
plt.show()
3D PLOT FOR MEAN_PROFILE VS STD_PROFILE VS SKEWNESS_DMSNR_CURVE
由轮廓平均值、轮廓标准差、DMSNR曲线偏斜度构成的三维图
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(13,13))
ax = fig.add_subplot(111,projection = "3d")
ax.scatter(data[data["target_class"] == 1][["mean_profile"]],data[data["target_class"] == 1][["std_profile"]],data[data["target_class"] == 1][["skewness_dmsnr_curve"]],
alpha=.5,s=80,linewidth=2,edgecolor="k",color="lime",label="Pulsar star")
ax.scatter(data[data["target_class"] == 0][["mean_profile"]],data[data["target_class"] == 0][["std_profile"]],data[data["target_class"] == 0][["skewness_dmsnr_curve"]],
alpha=.5,s=80,linewidth=2,edgecolor="k",color="r",label=" NotPulsar star")
ax.set_xlabel("mean_profile",fontsize=15)
ax.set_ylabel("std_profile",fontsize=15)
ax.set_zlabel("skewness_dmsnr_curve",fontsize=15)
plt.legend(loc="best")
fig.set_facecolor("w")
plt.title("3D PLOT FOR MEAN_PROFILE VS STD_PROFILE VS SKEWNESS_DMSNR_CURVE",fontsize=10)
plt.show()
DENSITY PLOT BETWEEN MEAN_PROFILE & STD_PROFILE
轮廓平均值和轮廓标准差间的密度图
sns.jointplot(data["mean_profile"],data["std_profile"],kind="kde",scale=10)
plt.show()
Bubble plot between mean,std for skewness and kurtosis
平均值和标准差分别对应偏斜度、峰度的气泡图
plt.figure(figsize=(13,7))
plt.subplot(121)
plt.scatter(st["mean_profile"],st["std_profile"],alpha=.5,
s=st["skewness_profile"]*3,linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_profile"],nst["std_profile"],alpha=.5,
s=nst["skewness_profile"]*3,linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_profile")
plt.ylabel("std_profile")
plt.title("Bubble plot for mean,std and skewness")
plt.subplot(122)
plt.scatter(st["mean_profile"],st["std_profile"],alpha=.5,
s=st["kurtosis_profile"]*5,linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_profile"],nst["std_profile"],alpha=.5,
s=nst["kurtosis_profile"]*5,linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_profile")
plt.ylabel("std_profile")
plt.title("Bubble plot for mean,std and kurtosis")
plt.show()
Bubble plot between mean_dmsnr_curve,std_dmsnr_curve for skewness_dmsnr_curve and kurtosis_dmsnr_curve
DMSNR曲线中平均值和标准差分别对应偏斜度、峰度的气泡图
plt.figure(figsize=(13,7))
plt.subplot(121)
plt.scatter(st["mean_dmsnr_curve"],st["std_dmsnr_curve"],
alpha=.5,s=st["skewness_dmsnr_curve"],linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_dmsnr_curve"],nst["std_dmsnr_curve"],
alpha=.5,s=nst["skewness_dmsnr_curve"],linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_dmsnr_curve")
plt.ylabel("std_dmsnr_curve")
plt.title("Bubble plot for mean,std and skewness of dmsnr_curve")
plt.subplot(122)
plt.scatter(st["mean_dmsnr_curve"],st["std_dmsnr_curve"],
alpha=.5,s=st["kurtosis_dmsnr_curve"],linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_dmsnr_curve"],nst["std_dmsnr_curve"],
alpha=.5,s=nst["kurtosis_dmsnr_curve"],linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_dmsnr_curve")
plt.ylabel("std_dmsnr_curve")
plt.title("Bubble plot for mean,std and kurtosis of dmsnr_curve")
plt.show()
visualizing the distribution of a variables for target class
可视化目标类的变量分布
columns = [x for x in data.columns if x not in ["target_class"]]
length = len(columns)
plt.figure(figsize=(13,25))
for i,j in itertools.zip_longest(columns,range(length)):
plt.subplot(length/2,length/4,j+1)
sns.violinplot(x=data["target_class"],y=data[i],
palette=["Orangered","lime"],alpha=.5)
plt.title(i)
1.seaborn.violinplot:小提琴图是一种结合箱型图与核密度估计绘图。
seaborn.violinplot中文文档
Parllel coordinates plot to compare features between variables
用于比较变量之间特征的平行坐标图
from pandas.tools.plotting import parallel_coordinates
plt.figure(figsize=(14,8))
parallel_coordinates(data,"target_class",alpha=.5)
plt.show()
1.Parallel coordinates:平行坐标图是一种通常的可视化方法, 用于对高维几何和多元数据的可视化。
pandas Parallel_coordinates英文官网
Proportion of target class in train & test data
训练集和测试集中目标类的比例
from sklearn.model_selection import train_test_split
train , test = train_test_split(data,test_size = .3,random_state = 123)
plt.figure(figsize=(12,6))
plt.subplot(121)
train["target_class"].value_counts().plot.pie(labels = ["not star","star"],
autopct = "%1.0f%%",
shadow = True,explode=[0,.1])
plt.title("proportion of target class in train data")
plt.ylabel("")
plt.subplot(122)
test["target_class"].value_counts().plot.pie(labels = ["not star","star"],
autopct = "%1.0f%%",
shadow = True,explode=[0,.1])
plt.title("proportion of target class in train data")
plt.ylabel("")
plt.show()
MODEL
模型
#MODEL FUNCTION 模型函数pipeline
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,roc_curve,auc
def model(algorithm,dtrain_x,dtrain_y,dtest_x,dtest_y,of_type):
print ("*****************************************************************************************")
print ("MODEL - OUTPUT")
print ("*****************************************************************************************")
algorithm.fit(dtrain_x,dtrain_y)
predictions = algorithm.predict(dtest_x)
print (algorithm)
print ("\naccuracy_score :",accuracy_score(dtest_y,predictions))
print ("\nclassification report :\n",(classification_report(dtest_y,predictions)))
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(dtest_y,predictions),annot=True,fmt = "d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)
predicting_probabilites = algorithm.predict_proba(dtest_x)[:,1]
fpr,tpr,thresholds = roc_curve(dtest_y,predicting_probabilites)
plt.subplot(222)
plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
plt.legend(loc = "best")
plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)
if of_type == "feat":
dataframe = pd.DataFrame(algorithm.feature_importances_,dtrain_x.columns).reset_index()
dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
dataframe = dataframe.sort_values(by="coefficients",ascending = False)
plt.subplot(223)
ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
plt.title("FEATURE IMPORTANCES",fontsize =20)
for i,j in enumerate(dataframe["coefficients"]):
ax.text(.011,i,j,weight = "bold")
elif of_type == "coef" :
dataframe = pd.DataFrame(algorithm.coef_.ravel(),dtrain_x.columns).reset_index()
dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
dataframe = dataframe.sort_values(by="coefficients",ascending = False)
plt.subplot(223)
ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
plt.title("FEATURE IMPORTANCES",fontsize =20)
for i,j in enumerate(dataframe["coefficients"]):
ax.text(.011,i,j,weight = "bold")
elif of_type == "none" :
return (algorithm)
1.sklearn.metrics:评价指标,即检验机器学习模型效果的定量指标,是一个不可避免且十分重要的问题。
模型评估: 量化预测的质量3.3.1.2.根据 metric 函数定义您的评分策略
RandomForestClassifier
随机森林分类器
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier()
model(rf,train_X,train_Y,test_X,test_Y,"feat")
1.RandomForest :随机森林指的是利用多棵树对样本进行训练并预测的一种分类器。
集成方法.1.11.2.由随机树组成的森林
DecisionTreeClassifier
决策树分类器
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
model(dt,train_X,train_Y,test_X,test_Y,"feat")
1.DecisionTree:决策树是一种树形结构,其中每个内部节点表示一个属性上的测试,每个分支代表一个测试输出,每个叶节点代表一种类别。。
决策树
Extra Tree Classifier
极度随机树分类器
from sklearn.tree import ExtraTreeClassifier
etc = ExtraTreeClassifier()
model(etc,train_X,train_Y,test_X,test_Y,"feat")
1.Extra Tree:
Extremely Randomized Trees(ExrRa Trees)
Opencv2.4.9源码分析——Extremely randomized trees
GradientBoostingClassifier
梯度提升分类器
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
model(gbc,train_X,train_Y,test_X,test_Y,"feat")
1.GradientBoosting:
从头了解Gradient Boosting算法
Gaussian Naive Bayes
高斯朴素贝叶斯
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
model(nb,train_X,train_Y,test_X,test_Y,"none")
1.Gaussian Naive Bayes:
朴素贝叶斯
K- Nearest Neighbour Classifier
K最近邻分类器
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
model(knn,train_X,train_Y,test_X,test_Y,"none")
1.K- Nearest Neighbour:
最近邻
Ada Boost Classifier
自适应提升分类器
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
model(ada,train_X,train_Y,test_X,test_Y,"feat")
1.AdaBoost:
Adaboost入门教程——最通俗易懂的原理介绍