机器学习预测宇宙中的脉冲星（英文翻译）

predicting pulsar star in the universe by pavanraj159 （原文链接）

预测宇宙中的脉冲星（英文翻译）

封面

原文：
HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey .

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter .

As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

翻译：
HTRU2是一个数据集，它记载了在1.高时间分辨率宇宙测量期间收集的脉冲星候选样本。

脉冲星是一种罕见的中子星，它能产生在地球上可以探测到的无线电辐射。脉冲星作为时空、星间介质和物质状态的探测器，具有相当大的科学价值。

当脉冲星旋转时，它发射的光束会扫过天空，当光束穿过我们的视线时，就会产生一种可检测到的宽带无线电辐射模式。而当脉冲星快速旋转时，这种模式会周期性地重复。因此，脉冲星搜索要使用大型射电望远镜寻找周期性的无线电信号。

1.高时间分辨率宇宙测量：一个起始于2008年11月雄心勃勃（~6000小时）的项目，为了测量整个南半天球的无线电脉冲和快速脉冲群（<1秒）。
高时间分辨率宇宙（HTRU）参考论文

原文：
Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators.

Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).

翻译：
每颗脉冲星产生的辐射模式略有不同，这是由于每次旋转变化不同。因此，通过观测到长度确定的脉冲星多次旋转的平均值，可以检测到被称为“候选星体”的潜在信号。在缺乏额外信息的情况下，每个候选星体都有可能成为一颗真正的脉冲星。然而在实际中，几乎所有的检测都是由射频干扰(RFI)和噪声引起的，使得合法信号难以被发现。

机器学习工具现在被用来自动标记脉冲星候选星体，以便于做快速分析。当中特别是分类系统被广泛采用，它将候选数据集视为二进制分类问题。本文中合法的脉冲星样本是少数正类，而虚假的样本是多数负类。

这里共享的数据集包含16259个由射频干扰(RFI)或噪声引起的虚假样本，以及1639个真实的脉冲星样本。这些真实样本都经过了人工注释者的检查。

首先每一行都列出了变量，类标签放在最后一列，使用的是0(负类)和1(正类)。

原文：

Attribute Information

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency . The remaining four variables are similarly obtained from the DM-SNR curve . These are summarised below:

Mean of the integrated profile.
Standard deviation of the integrated profile.
Excess kurtosis of the integrated profile.
Skewness of the integrated profile.
Mean of the DM-SNR curve.
Standard deviation of the DM-SNR curve.
Excess kurtosis of the DM-SNR curve.
Skewness of the DM-SNR curve.
Class

HTRU 2 Summary 17,898 total examples. 1,639 positive examples. 16,259 negative examples.

翻译：

属性信息

每个候选星体被描述为8个连续变量和1个类变量。前4个连续变量是通过简单统计整合后的脉冲轮廓(折叠轮廓)获得的。这是一个连续变量的数组，它描述了以1.天文经度为基础，信号同时在时间和频率被平均后的一种变化形式。而其余4个连续变量也类似从DM-SNR（频散量-信噪比）曲线得到。这些内容概述如下:

整合轮廓的平均值。
整合轮廓的标准差。
整合轮廓的超峰度。
整合轮廓的偏斜度。
DM-SNR曲线的平均值。
DM-SNR曲线的标准差。
DM-SNR曲线的超峰度。
DM-SNR曲线的偏斜度。
类别

HTRU2摘要：总共17898样本，1639正类样本，16259个负类样本。

1.天文经度：A点的天文子午面与本初子午面间的夹角λ，称天文经度。
天文缩约词表

Data

数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import itertools
warnings.filterwarnings("ignore")
%matplotlib inline
from PIL import Image

1.seaborn：Seaborn 是一个基于 matplotlib 且数据结构与 pandas 统一的统计图制作库，旨在以数据可视化为中心来挖掘与理解数据。
seaborn 0.9 中文文档
2.warnings：python开发中经常遇到报错的情况，但是warning通常并不影响程序的运行，运行下列语句控制警告错误的输出。
import warnings
warnings.filterwarnings("ignore")
3.itertools：Python内置的迭代器函数。
itertools 中文文档
 廖雪峰itertools教程

data = pd.read_csv(r"../input/predicting-a-pulsar-star/pulsar_stars.csv")
data.head()

数据加载

Data dimensions

数据维度

print ("Number of rows    :",data.shape[0])
print ("Number of columns :",data.shape[1])

数据维度

Data Information

数据信息

print ("data info  :",data.info())

数据信息

Missing values

缺失值

print (data.isnull().sum())

缺失值

Data summary

数据摘要

plt.figure(figsize=(12,8))
sns.heatmap(data.describe()[1:].transpose(),    //矩形数据集
            annot=True,     //为True，在每个热力图单元格中写入数据值
            linecolor="w", 
            linewidth=2,cmap=sns.color_palette("Set2"))
plt.title("Data summary")
plt.show()

数据摘要

1.seaborn.heatmap：将矩形数据绘制为颜色编码矩阵。
seaborn.heatmap中文文档

CORRELATION BETWEEN VARIABLES

变量间的相关性

correlation = data.corr()
plt.figure(figsize=(10,8))
sns.heatmap(correlation,annot=True,
            cmap=sns.color_palette("magma"),
            linewidth=2,edgecolor="k")
plt.title("CORRELATION BETWEEN VARIABLES")
plt.show()

变量间的相关性

Proportion of target variable in dataset

数据集中目标变量的比例

plt.figure(figsize=(12,6))
plt.subplot(121)
ax = sns.countplot(y = data["target_class"],
                   palette=["r","g"],
                   linewidth=1,
                   edgecolor="k"*2)
for i,j in enumerate(data["target_class"].value_counts().values):
    ax.text(.7,i,j,weight = "bold",fontsize = 27)
plt.title("Count for target variable in datset")

plt.subplot(122)
plt.pie(data["target_class"].value_counts().values,
        labels=["not pulsar stars","pulsar stars"],
        autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()

数据集中目标变量的比例

1.seaborn.countplot英文文档

COMPARING MEAN & STANDARD DEVIATION BETWEEN ATTRIBUTES FOR TARGET CLASSES

比较目标类属性间的平均值和标准差

compare = data.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve','skewness_dmsnr_curve']].mean().reset_index()
compare = compare.drop("target_class",axis =1)
compare.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")

compare1 = data.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve','skewness_dmsnr_curve']].std().reset_index()
compare1 = compare1.drop("target_class",axis=1)
compare1.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")

plt.show()

比较目标类属性间的平均值

比较目标类属性间的标准差

compare_mean = compare.transpose().reset_index()
compare_mean = compare_mean.rename(columns={'index':"features", 0:"not_star", 1:"star"})
plt.figure(figsize=(13,14))
plt.subplot(211)
sns.pointplot(x= "features",y="not_star",data=compare_mean,color="r")
sns.pointplot(x= "features",y="star",data=compare_mean,color="g")
plt.xticks(rotation =60)
plt.xlabel("")
plt.grid(True,alpha=.3)
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")

compare_std = compare1.transpose().reset_index()
compare_std = compare_std.rename(columns={'index':"features", 0:"not_star", 1:"star"})
plt.subplot(212)
sns.pointplot(x= "features",y="not_star",data=compare_std,color="r")
sns.pointplot(x= "features",y="star",data=compare_std,color="g")
plt.xticks(rotation =60)
plt.grid(True,alpha=.3)
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.subplots_adjust(hspace =.4)
print ("[GREEN == STAR , RED == NOTSTAR]")
plt.show()

目标类图例

比较目标类属性间的平均值和标准差

compare_mean
plt.figure(figsize=(10,10))
plt.subplot(211)
sns.barplot(y="features",x="not_star",
            data=compare_mean,color="r")
sns.barplot(y="features",x="star",
            data=compare_mean,color="g")
plt.title("COMPARING MEAN OF ATTRIBUTES FOR TARGET CLASSES")

plt.subplot(212)
sns.barplot(y="features",x="star",
            data=compare_std,color="g")
sns.barplot(y="features",x="not_star",
            data=compare_std,color="r")
plt.title("COMPARING STANDARD DEVIATION OF ATTRIBUTES FOR TARGET CLASSES")
plt.subplots_adjust(wspace =.5)

比较目标类属性间的平均值和标准差

1.seaborn.pointplot英文文档
2.seaborn.barplot英文文档

DISTIBUTION OF VARIABLES IN DATA SET

数据集中变量的分布

columns = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr_curve', 'std_dmsnr_curve', 'kurtosis_dmsnr_curve',
           'skewness_dmsnr_curve']
length  = len(columns)
colors  = ["r","g","b","m","y","c","k","orange"] 

plt.figure(figsize=(13,20))
for i,j,k in itertools.zip_longest(columns,range(length),colors):
    plt.subplot(length/2,length/4,j+1)
    sns.distplot(data[i],color=k)
    plt.title(i)
    plt.subplots_adjust(hspace = .3)
    plt.axvline(data[i].mean(),color = "k",linestyle="dashed",label="MEAN")
    plt.axvline(data[i].std(),color = "b",linestyle="dotted",label="STANDARD DEVIATION")
    plt.legend(loc="upper right")
    
print ("***************************************")
print ("DISTIBUTION OF VARIABLES IN DATA SET")
print ("***************************************")

数据集中变量的分布

1.可视化数据集的分布：在seaborn中想要快速查看单变量分布的最方便的方法是使用distplot()函数。
2.seaborn.distplot中文文档

PAIR PLOT BETWEEN ALL VARIABLES

所有变量间的相关矩阵图

sns.pairplot(data,hue="target_class")
plt.title("pair plot for variables")
plt.show()

所有变量间的相关矩阵图

1.seaborn.pairplot中文文档

Scatter plot between variable for target classes

目标类变量间的散点图

plt.figure(figsize=(14,7))
plt.subplot(121)
plt.scatter(x = "kurtosis_profile",y = "skewness_profile",
            data=data[data["target_class"] == 1],alpha=.7,
            label="pulsar stars",s=30,color = "g",linewidths=.4,edgecolors="black")
plt.scatter(x = "kurtosis_profile",y = "skewness_profile",
            data=data[data["target_class"] == 0],alpha=.6,
            label="not pulsar stars",s=30,color ="r",linewidths=.4,edgecolors="black")
plt.axvline(data[data["target_class"] == 1]["kurtosis_profile"].mean(),
            color = "g",linestyle="dashed",label="mean pulsar star")
plt.axvline(data[data["target_class"] == 0]["kurtosis_profile"].mean(),
            color = "r",linestyle="dashed",label ="mean non pulsar star")
plt.axhline(data[data["target_class"] == 1]["skewness_profile"].mean(),
            color = "g",linestyle="dashed")
plt.axhline(data[data["target_class"] == 0]["skewness_profile"].mean(),
            color = "r",linestyle="dashed")
plt.legend(loc ="best")
plt.xlabel("kurtosis profile")
plt.ylabel("skewness profile")
plt.title("Scatter plot for skewness and kurtosis for target classes")

plt.subplot(122)
plt.scatter(x = "skewness_dmsnr_curve",y = 'kurtosis_dmsnr_curve',
            data=data[data["target_class"] == 0],alpha=.7,
            label="not pulsar stars",s=30,color ="r",linewidths=.4,edgecolors="black")
plt.scatter(x = "skewness_dmsnr_curve",y = 'kurtosis_dmsnr_curve',
            data=data[data["target_class"] == 1],alpha=.7,
            label="pulsar stars",s=30,color = "g",linewidths=.4,edgecolors="black")
plt.axvline(data[data["target_class"] == 1]["kurtosis_dmsnr_curve"].mean(),
            color = "g",linestyle="dashed",label ="mean pulsar star")
plt.axvline(data[data["target_class"] == 0]["kurtosis_dmsnr_curve"].mean(),
            color = "r",linestyle="dashed",label ="mean non pulsar star")
plt.axhline(data[data["target_class"] == 1]["skewness_dmsnr_curve"].mean(),
            color = "g",linestyle="dashed")
plt.axhline(data[data["target_class"] == 0]["skewness_dmsnr_curve"].mean(),
            color = "r",linestyle="dashed")
plt.legend(loc ="best")
plt.xlabel("skewness_dmsnr_curve")
plt.ylabel('kurtosis_dmsnr_curve')
plt.title("||Scatter plot for skewness and kurtosis of dmsnr_curve for target classes")
plt.subplots_adjust(wspace =.4)

目标类变量间的散点图

1.seaborn.scatterplot英文文档

BOXPLOT FOR VARIABLES IN DATA SET WITH TARGET CLASS

数据集中各变量对应目标类的箱线图

columns = [x for x in data.columns if x not in ["target_class"]]
length  = len(columns)
plt.figure(figsize=(13,20))
for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot(4,2,j+1)
    sns.lvplot(x=data["target_class"],y=data[i],palette=["orangered","lime"])
    plt.title(i)
    plt.subplots_adjust(hspace=.3)
    plt.axhline(data[i].mean(),linestyle = "dashed",color ="k",label ="Mean value for data")
    plt.legend(loc="best")
    
print ("****************************************************")
print ("BOXPLOT FOR VARIABLES IN DATA SET WITH TARGET CLASS")
print ("****************************************************")

数据集中各变量对应目标类的箱线图

1.lvplot（增强箱图）:最早在Heike Hofmann于2011年提出的时候被称为Letter-value plots，现在在最新的seaborn 0.9.0版中，相应的函数被更名为seaborn.boxenplot()。
2.seaborn.boxenplot英文文档

Area plot for attributes of pulsar stars vs non pulsar stars

脉冲星与非脉冲星中各属性的面积图

st = data[data["target_class"] == 1].reset_index()
nst= data[data["target_class"] == 0].reset_index()
new = pd.concat([nst,st]).reset_index()

plt.figure(figsize=(13,10))
plt.stackplot(new.index,new["mean_profile"],
              alpha =.5,color="b",labels=["mean_profile"])
plt.stackplot(new.index,new["std_profile"],
              alpha=.5,color="r",labels=["std_profile"])
plt.stackplot(new.index,new["skewness_profile"],
              alpha=.5,color ="g",labels=["skewness_profile"])
plt.stackplot(new.index,new["kurtosis_profile"],
              alpha=.5,color = "m",labels=["kurtosis_profile"])
plt.axvline(x=16259,color = "black",linestyle="dashed",
            label = "separating pulsars vs non pulsars")
plt.axhline(new["mean_profile"].mean(),color = "b",
            linestyle="dashed",label = "average mean profile")
plt.axhline(new["std_profile"].mean(),color = "r",
            linestyle="dashed",label = "average std profile")
plt.axhline(new["skewness_profile"].mean(),color = "g",
            linestyle="dashed",label = "average skewness profile")
plt.axhline(new["kurtosis_profile"].mean(),color = "m",
            linestyle="dashed",label = "average kurtosis profile")
plt.legend(loc="best")
plt.title("Area plot for attributes for pulsar stars vs non pulsar stars")
plt.show()

脉冲星与非脉冲星中各属性的面积图

1.如何使用Matplotlib创建堆栈图

Area plot for dmsnr_curve attributes of pulsar stars vs non pulsar star

脉冲星与非脉冲星中各DM-SNR曲线属性的面积图

plt.figure(figsize=(13,10))
plt.stackplot(new.index,new["mean_dmsnr_curve"],
              color="b",alpha=.5,labels=["mean_dmsnr_curve"])
plt.stackplot(new.index,new["std_dmsnr_curve"],
              color="r",alpha=.5,labels=["std_dmsnr_curve"])
plt.stackplot(new.index,new["skewness_dmsnr_curve"],color="g",
              alpha=.5,labels=["skewness_dmsnr_curve"])
plt.stackplot(new.index,new["kurtosis_dmsnr_curve"],color="m",
              alpha=.5,labels=["kurtosis_dmsnr_curve"])
plt.axvline(x=16259,color = "black",linestyle="dashed",
            label = "separating pulsars vs non pulsars")
plt.axhline(new["mean_dmsnr_curve"].mean(),color = "b",linestyle="dashed",
            label = "average mean dmsnr_curve")
plt.axhline(new["std_dmsnr_curve"].mean(),color = "r",
            linestyle="dashed",label = "average std dmsnr_curve")
plt.axhline(new["skewness_dmsnr_curve"].mean(),color = "g",
            linestyle="dashed",label = "average skewness dmsnr_curve")
plt.axhline(new["kurtosis_dmsnr_curve"].mean(),color = "m",
            linestyle="dashed",label = "average kurtosis dmsnr_curve")
plt.legend(loc="best")
plt.title("Area plot for dmsnr_curve attributes for pulsar stars vs non pulsar stars")
plt.show()

脉冲星与非脉冲星中各DM-SNR曲线属性的面积图

3D PLOT FOR MEAN_PROFILE VS STD_PROFILE VS SKEWNESS_DMSNR_CURVE

由轮廓平均值、轮廓标准差、DMSNR曲线偏斜度构成的三维图

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(13,13))
ax  = fig.add_subplot(111,projection = "3d")

ax.scatter(data[data["target_class"] == 1][["mean_profile"]],data[data["target_class"] == 1][["std_profile"]],data[data["target_class"] == 1][["skewness_dmsnr_curve"]],
           alpha=.5,s=80,linewidth=2,edgecolor="k",color="lime",label="Pulsar star")
ax.scatter(data[data["target_class"] == 0][["mean_profile"]],data[data["target_class"] == 0][["std_profile"]],data[data["target_class"] == 0][["skewness_dmsnr_curve"]],
           alpha=.5,s=80,linewidth=2,edgecolor="k",color="r",label=" NotPulsar star")

ax.set_xlabel("mean_profile",fontsize=15)
ax.set_ylabel("std_profile",fontsize=15)
ax.set_zlabel("skewness_dmsnr_curve",fontsize=15)
plt.legend(loc="best")
fig.set_facecolor("w")
plt.title("3D PLOT FOR MEAN_PROFILE VS STD_PROFILE VS SKEWNESS_DMSNR_CURVE",fontsize=10)
plt.show()

由轮廓平均值、轮廓标准差、DMSNR曲线偏斜度构成的三维图

1.matplotlib mplot3d英文官网

DENSITY PLOT BETWEEN MEAN_PROFILE & STD_PROFILE

轮廓平均值和轮廓标准差间的密度图

sns.jointplot(data["mean_profile"],data["std_profile"],kind="kde",scale=10)
plt.show()

轮廓平均值和轮廓标准差间的密度图

1.seaborn.jointplot英文文档

Bubble plot between mean,std for skewness and kurtosis

平均值和标准差分别对应偏斜度、峰度的气泡图

plt.figure(figsize=(13,7))
plt.subplot(121)
plt.scatter(st["mean_profile"],st["std_profile"],alpha=.5,
            s=st["skewness_profile"]*3,linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_profile"],nst["std_profile"],alpha=.5,
            s=nst["skewness_profile"]*3,linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_profile")
plt.ylabel("std_profile")
plt.title("Bubble plot for mean,std and skewness")


plt.subplot(122)
plt.scatter(st["mean_profile"],st["std_profile"],alpha=.5,
            s=st["kurtosis_profile"]*5,linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_profile"],nst["std_profile"],alpha=.5,
            s=nst["kurtosis_profile"]*5,linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_profile")
plt.ylabel("std_profile")
plt.title("Bubble plot for mean,std and kurtosis")
plt.show()

平均值和标准差分别对应偏斜度、峰度的气泡图

Bubble plot between mean_dmsnr_curve,std_dmsnr_curve for skewness_dmsnr_curve and kurtosis_dmsnr_curve

DMSNR曲线中平均值和标准差分别对应偏斜度、峰度的气泡图

plt.figure(figsize=(13,7))
plt.subplot(121)
plt.scatter(st["mean_dmsnr_curve"],st["std_dmsnr_curve"],
            alpha=.5,s=st["skewness_dmsnr_curve"],linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_dmsnr_curve"],nst["std_dmsnr_curve"],
            alpha=.5,s=nst["skewness_dmsnr_curve"],linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_dmsnr_curve")
plt.ylabel("std_dmsnr_curve")
plt.title("Bubble plot for mean,std and skewness of dmsnr_curve")


plt.subplot(122)
plt.scatter(st["mean_dmsnr_curve"],st["std_dmsnr_curve"],
            alpha=.5,s=st["kurtosis_dmsnr_curve"],linewidths=1,color="g",label="pulsar_star")
plt.scatter(nst["mean_dmsnr_curve"],nst["std_dmsnr_curve"],
            alpha=.5,s=nst["kurtosis_dmsnr_curve"],linewidths=1,color="r",label="pulsar_star")
plt.legend(loc="best")
plt.xlabel("mean_dmsnr_curve")
plt.ylabel("std_dmsnr_curve")
plt.title("Bubble plot for mean,std and kurtosis of dmsnr_curve")
plt.show()

DMSNR曲线中平均值和标准差分别对应偏斜度、峰度的气泡图

visualizing the distribution of a variables for target class

可视化目标类的变量分布

columns = [x for x in data.columns if x not in ["target_class"]]
length  = len(columns)

plt.figure(figsize=(13,25))

for i,j in itertools.zip_longest(columns,range(length)):
    plt.subplot(length/2,length/4,j+1)
    sns.violinplot(x=data["target_class"],y=data[i],
                   palette=["Orangered","lime"],alpha=.5)
    plt.title(i)

可视化目标类的变量分布

1.seaborn.violinplot：小提琴图是一种结合箱型图与核密度估计绘图。
seaborn.violinplot中文文档

Parllel coordinates plot to compare features between variables

用于比较变量之间特征的平行坐标图

from pandas.tools.plotting import parallel_coordinates
plt.figure(figsize=(14,8))
parallel_coordinates(data,"target_class",alpha=.5)
plt.show()

用于比较变量之间特征的平行坐标图

1.Parallel coordinates：平行坐标图是一种通常的可视化方法，用于对高维几何和多元数据的可视化。
pandas Parallel_coordinates英文官网

Proportion of target class in train & test data

训练集和测试集中目标类的比例

from sklearn.model_selection import train_test_split

train , test = train_test_split(data,test_size = .3,random_state = 123)

plt.figure(figsize=(12,6))
plt.subplot(121)
train["target_class"].value_counts().plot.pie(labels = ["not star","star"],
                                              autopct = "%1.0f%%",
                                              shadow = True,explode=[0,.1])
plt.title("proportion of target class in train data")
plt.ylabel("")
plt.subplot(122)
test["target_class"].value_counts().plot.pie(labels = ["not star","star"],
                                             autopct = "%1.0f%%",
                                             shadow = True,explode=[0,.1])
plt.title("proportion of target class in train data")
plt.ylabel("")
plt.show()

训练集和测试集中目标类的比例

MODEL

模型

#MODEL FUNCTION 模型函数pipeline

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,roc_curve,auc

def model(algorithm,dtrain_x,dtrain_y,dtest_x,dtest_y,of_type):
    
    print ("*****************************************************************************************")
    print ("MODEL - OUTPUT")
    print ("*****************************************************************************************")
    algorithm.fit(dtrain_x,dtrain_y)
    predictions = algorithm.predict(dtest_x)
    
    print (algorithm)
    print ("\naccuracy_score :",accuracy_score(dtest_y,predictions))
    
    print ("\nclassification report :\n",(classification_report(dtest_y,predictions)))
        
    plt.figure(figsize=(13,10))
    plt.subplot(221)
    sns.heatmap(confusion_matrix(dtest_y,predictions),annot=True,fmt = "d",linecolor="k",linewidths=3)
    plt.title("CONFUSION MATRIX",fontsize=20)
    
    predicting_probabilites = algorithm.predict_proba(dtest_x)[:,1]
    fpr,tpr,thresholds = roc_curve(dtest_y,predicting_probabilites)
    plt.subplot(222)
    plt.plot(fpr,tpr,label = ("Area_under the curve :",auc(fpr,tpr)),color = "r")
    plt.plot([1,0],[1,0],linestyle = "dashed",color ="k")
    plt.legend(loc = "best")
    plt.title("ROC - CURVE & AREA UNDER CURVE",fontsize=20)
    
    if  of_type == "feat":
        
        dataframe = pd.DataFrame(algorithm.feature_importances_,dtrain_x.columns).reset_index()
        dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
        dataframe = dataframe.sort_values(by="coefficients",ascending = False)
        plt.subplot(223)
        ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
        plt.title("FEATURE IMPORTANCES",fontsize =20)
        for i,j in enumerate(dataframe["coefficients"]):
            ax.text(.011,i,j,weight = "bold")
    
    elif of_type == "coef" :
        
        dataframe = pd.DataFrame(algorithm.coef_.ravel(),dtrain_x.columns).reset_index()
        dataframe = dataframe.rename(columns={"index":"features",0:"coefficients"})
        dataframe = dataframe.sort_values(by="coefficients",ascending = False)
        plt.subplot(223)
        ax = sns.barplot(x = "coefficients" ,y ="features",data=dataframe,palette="husl")
        plt.title("FEATURE IMPORTANCES",fontsize =20)
        for i,j in enumerate(dataframe["coefficients"]):
            ax.text(.011,i,j,weight = "bold")
            
    elif of_type == "none" :
        return (algorithm)

1.sklearn.metrics：评价指标，即检验机器学习模型效果的定量指标，是一个不可避免且十分重要的问题。
模型评估: 量化预测的质量3.3.1.2.根据 metric 函数定义您的评分策略

RandomForestClassifier

随机森林分类器

from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier()
model(rf,train_X,train_Y,test_X,test_Y,"feat")

随机森林分类器的评价报告

随机森林分类器的评价指标图

1.RandomForest ：随机森林指的是利用多棵树对样本进行训练并预测的一种分类器。
集成方法.1.11.2.由随机树组成的森林

DecisionTreeClassifier

决策树分类器

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
model(dt,train_X,train_Y,test_X,test_Y,"feat")

决策树分类器的评价报告

决策树分类器的指标图

1.DecisionTree：决策树是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。。
决策树

Extra Tree Classifier

极度随机树分类器

from sklearn.tree import ExtraTreeClassifier
etc = ExtraTreeClassifier()
model(etc,train_X,train_Y,test_X,test_Y,"feat")

极度随机树分类器的评价报告

极度随机树分类器的评价指标图

1.Extra Tree：
Extremely Randomized Trees(ExrRa Trees)
Opencv2.4.9源码分析——Extremely randomized trees

GradientBoostingClassifier

梯度提升分类器

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
model(gbc,train_X,train_Y,test_X,test_Y,"feat")

梯度提升分类器的评价报告

梯度提升分类器的评价指标图

1.GradientBoosting：
从头了解Gradient Boosting算法

Gaussian Naive Bayes

高斯朴素贝叶斯

from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
model(nb,train_X,train_Y,test_X,test_Y,"none")

高斯朴素贝叶斯的评价报告

高斯朴素贝叶斯的评价指标图

1.Gaussian Naive Bayes：
朴素贝叶斯

K- Nearest Neighbour Classifier

K最近邻分类器

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
model(knn,train_X,train_Y,test_X,test_Y,"none")

K最近邻分类器的评价报告

K最近邻分类器的评价指标图

1.K- Nearest Neighbour：
最近邻

Ada Boost Classifier

自适应提升分类器

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
model(ada,train_X,train_Y,test_X,test_Y,"feat")

自适应提升分类器的评价指标图报告

自适应提升分类器的评价指标图

1.AdaBoost：
Adaboost入门教程——最通俗易懂的原理介绍

最后编辑于：2020.07.02 09:36:52

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 215,539评论 6赞 497
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 91,911评论 3赞 391
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 161,337评论 0赞 351
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 57,723评论 1赞 290
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 66,795评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 50,762评论 1赞 294
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,742评论 3赞 416
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,508评论 0赞 271
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 44,954评论 1赞 308
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,247评论 2赞 331
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,404评论 1赞 345
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,104评论 5赞 340
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,736评论 3赞 324
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,352评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,557评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,371评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,292评论 2赞 352

机器学习预测宇宙中的脉冲星（英文翻译）

predicting pulsar star in the universe by pavanraj159 （原文链接）

预测宇宙中的脉冲星 （英文翻译）

Attribute Information

属性信息

Data

数据

Data dimensions

数据维度

Data Information

数据信息

Missing values

缺失值

Data summary

数据摘要

CORRELATION BETWEEN VARIABLES

变量间的相关性

Proportion of target variable in dataset

数据集中目标变量的比例

COMPARING MEAN & STANDARD DEVIATION BETWEEN ATTRIBUTES FOR TARGET CLASSES

比较目标类属性间的平均值和标准差

DISTIBUTION OF VARIABLES IN DATA SET

数据集中变量的分布

PAIR PLOT BETWEEN ALL VARIABLES

所有变量间的相关矩阵图

Scatter plot between variable for target classes

目标类变量间的散点图

BOXPLOT FOR VARIABLES IN DATA SET WITH TARGET CLASS

数据集中各变量对应目标类的箱线图

Area plot for attributes of pulsar stars vs non pulsar stars

脉冲星与非脉冲星中各属性的面积图

Area plot for dmsnr_curve attributes of pulsar stars vs non pulsar star

脉冲星与非脉冲星中各DM-SNR曲线属性的面积图

3D PLOT FOR MEAN_PROFILE VS STD_PROFILE VS SKEWNESS_DMSNR_CURVE

由轮廓平均值、轮廓标准差、DMSNR曲线偏斜度构成的三维图

DENSITY PLOT BETWEEN MEAN_PROFILE & STD_PROFILE

轮廓平均值和轮廓标准差间的密度图

Bubble plot between mean,std for skewness and kurtosis

平均值和标准差分别对应偏斜度、峰度的气泡图

Bubble plot between mean_dmsnr_curve,std_dmsnr_curve for skewness_dmsnr_curve and kurtosis_dmsnr_curve

DMSNR曲线中平均值和标准差分别对应偏斜度、峰度的气泡图

visualizing the distribution of a variables for target class

可视化目标类的变量分布

Parllel coordinates plot to compare features between variables

用于比较变量之间特征的平行坐标图

Proportion of target class in train & test data

训练集和测试集中目标类的比例

MODEL

模型

RandomForestClassifier

随机森林分类器

DecisionTreeClassifier

决策树分类器

Extra Tree Classifier

极度随机树分类器

GradientBoostingClassifier

梯度提升分类器

Gaussian Naive Bayes

高斯朴素贝叶斯

K- Nearest Neighbour Classifier

K最近邻分类器

Ada Boost Classifier

自适应提升分类器

推荐阅读更多精彩内容

预测宇宙中的脉冲星（英文翻译）