pandas基本使用(一)

什么是pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

这次复制个英文的，显得专业，反正我也看不懂。

为什么要学习pandas

那么问题来了：numpy已经能够帮助我们处理数据，能够结合matplotlib解决我们数据分析的问题，那么pandas学习的目的在什么地方呢？

numpy能够帮我们处理处理数值型数据，但是这还不够
很多时候，我们的数据除了数值之外，还有字符串，还有时间序列等
比如：我们通过爬虫获取到了存储在数据库中的数据
比如：之前youtube的例子中除了数值之外还有国家的信息，视频的分类(tag)信息，标题信息等

所以，numpy能够帮助我们处理数值，但是pandas除了处理数值之外(基于numpy)，还能够帮助我们处理其他类型的数据

pandas的常用数据类型

Series 一维，带标签数组
DataFrame 二维，Series容器

pandas之Series创建

In [4]: t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10]))

In [5]: t
Out[5]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [7]: type(t)
Out[7]: pandas.core.series.Series

注意这样几个问题：

pd.Series能干什么，能够传入什么类型的数据让其变为series结构
index是什么，在什么位置，对于我们常见的数据库数据或者ndarray来说，index到底是什么
如何给一组数据指定index

In [8]: a = {string.ascii_uppercase[i]: i for i in range(10)}  # 字典推导式创建一个字典a

In [9]: a
Out[9]: 
{'A': 0,
 'B': 1,
 'C': 2,
 'D': 3,
 'E': 4,
 'F': 5,
 'G': 6,
 'H': 7,
 'I': 8,
 'J': 9}

# 通过字典创建一个Series，注意其中的索引就是字典的键
In [10]: pd.Series(a)
Out[10]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

# 重新给其指定其他的索引之后，如果能够对应上，就取其值，如果不能，就位Nan
# 其实非常好理解，是吧？
# 一个人有10种水果，你要了苹果，香蕉，菠萝，他有苹果，香蕉，但是没有菠萝，这时候菠萝就是nan
In [12]: pd.Series(a, index=list(string.ascii_uppercase[5:15]))
Out[12]: 
F    5.0
G    6.0
H    7.0
I    8.0
J    9.0
K    NaN
L    NaN
M    NaN
N    NaN
O    NaN
dtype: float64

为什么类型变为float了呢？
numpy中nan为float，pandas会自动根据数据类更改series的dtype类型
那么问题来了，如何修改dtype呢
和numpy的方法一样

pandas之Series切片和索引

In [13]: t
Out[13]: 
A    0
B    1
C    2
D    3
E    4
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [14]: t[2:10:2]
Out[14]: 
C    2
E    4
G    6
I    8
dtype: int64

In [15]: t[1]
Out[15]: 1

In [16]: t[[2, 3, 6]]
Out[16]: 
C    2
D    3
G    6
dtype: int64

In [17]: t[t>4]
Out[17]: 
F    5
G    6
H    7
I    8
J    9
dtype: int64

In [18]: t["F"]
Out[18]: 5

In [19]: t[["A", "F", "g"]]
Out[19]: 
A    0.0
F    5.0
g    NaN
dtype: float64

切片：直接传入start end或者步长即可
索引：一个的时候直接传入序号或者index，多个的时候传入序号或者index的列表

pandas之Series的索引和值

对于一个陌生的series类型，我们如何知道他的索引和具体的值呢？

In [20]: t.index
Out[20]: Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'], dtype='object')

In [21]: t.values
Out[21]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [22]: type(t.index)
Out[22]: pandas.core.indexes.base.Index

In [23]: type(t.values)
Out[23]: numpy.ndarray

In [24]:

Series对象本质上由两个数组构成，一个数组构成对象的键（index，索引），一个数组构成对象的值（values），键 -> 值

ndarray的很多方法都可以运用于series类型，比如argmax，clip
series具有where方法，但是结果和ndarray不同

pandas之读取外部数据

现在假设我们有一个组关于狗的名字的统计数据，那么为了观察这组数据的情况，我们应该怎么做呢？

数据来源：https://www.kaggle.com/new-york-city/nyc-dog-names/data

pandas之读取外部数据

我们的这组数据存在csv中，我们直接使用pd. read_csv即可

import pandas as pd

# pandas读取csv中的文件
df = pd.read_csv("./dogNames2.csv")
# print(df.head())
# print(df.info())

# dataFrame中排序的方法
df = df.sort_values(by="Count_AnimalName", ascending=False)

# pandas取行或者取列的注意点
# - 方括号写数组，表示取行，对行进行操作
# - 写字符串，表示的去列索引，对列进行操作
# print(df[:20])
# print(df[:20]["Row_Labels"])

print(df[(800 < df["Count_AnimalName"]) & (df["Count_AnimalName"] < 1000)])

运行结果

      Row_Labels  Count_AnimalName
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823

和我们想象的有些差别，我们以为他会是一个Series类型，但是他是一个DataFrame，那么接下来我们就来了解这种数据类型

但是，还有一个问题：
对于数据库比如mysql或者mongodb中数据我们如何使用呢？
pd.read_sql(sql_sentence,connection)

那么，mongodb呢？
其实这些数据一般都是放在MongoDB里的，那么pandas并不能去MongoDB读取数据，所以还是像以前使用pymongo，读取数据，构造好，再使用pandas进行处理

pandas之DataFrame

In [26]: t
Out[26]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

我们都知道竖着的那一列是索引，那横着的那一列是什么呢？

DataFrame对象既有行索引，又有列索引

行索引，表明不同行，横向索引，叫index，0轴，axis=0
列索引，表名不同列，纵向索引，叫columns，1轴，axis=1

pandas之DataFrame

In [27]: t = pd.DataFrame(np.arange(12).reshape((3,4)), index=list(string.ascii_uppercase[:3]), columns=list(string.ascii_uppercase[-4:]))

In [28]: t
Out[28]: 
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

那么问题来了：

DataFrame和Series有什么关系呢？
Series能够传入字典，那么DataFrame能够传入字典作为数据么？那么mongodb的数据是不是也可以这样传入呢？
对于一个dataframe类型，既有行索引，又有列索引，我们能够对他做什么操作呢

和一个ndarray一样，我们通过shape，ndim，dtype了解这个ndarray的基本信息，那么对于DataFrame我们有什么方法了解呢

DateFrame的基础属性

df.shape  # 行数 列数
df.dtypes  # 列数据类型
df.ndim  # 数据维度
df.index  # 行索引
df.columns  # 列索引
df.values  # 对象值，二维ndarray数组

DateFrame整体情况查询

df.head(3)  # 显示头部几行，默认5行
df.tail(3)  # 显示末尾几行，默认5行
df.info()  # 相关信息概览: 行数，列数，列索引，列非空值个数，列类型，内存占用
df.describe()  # 快速综合统计结果: 计数，均值，标准差，最大值，四分位数，最小值

那么回到之前我们读取的狗名字统计的数据上，我们尝试一下刚刚的方法

# coding=utf-8
import pandas as pd

df = pd.read_csv("./dogNames2.csv")
print(df.head())
print("*" * 100)
print(df.info())
print("*" * 100)

#dataFrame中排序的方法
df = df.sort_values(by="Count_AnimalName",ascending=False)
print(df.head(5))
print("*" * 100)

#pandas取行或者列的注意点
# - 方括号写数组,表示取行,对行进行操作
# - 写字符串,表示的去列索引,对列进行操作
print(df[:20])
print("*" * 100)

print(df["Row_Labels"])
print("*" * 100)

print(type(df["Row_Labels"]))
print("*" * 100)

运行结果

  Row_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1
4      90203                 1
****************************************************************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16220 entries, 0 to 16219
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Row_Labels        16217 non-null  object
 1   Count_AnimalName  16220 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 253.6+ KB
None
****************************************************************************************************
      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
****************************************************************************************************
      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
8417        LOLA               795
8552       LUCKY               723
8560        LUCY               710
2032       BUDDY               677
3641       DAISY               649
11703   PRINCESS               603
829       BAILEY               532
9766       MOLLY               519
14466      TEDDY               485
2913       CHLOE               465
14779       TOBY               446
8620        LUNA               432
6515        JACK               425
8788      MAGGIE               393
13762     SOPHIE               383
****************************************************************************************************
1156       BELLA
9140         MAX
2660     CHARLIE
3251        COCO
12368      ROCKY
          ...   
6884        J-LO
6888       JOANN
6890        JOAO
6891     JOAQUIN
16219      39743
Name: Row_Labels, Length: 16220, dtype: object
****************************************************************************************************
<class 'pandas.core.series.Series'>
****************************************************************************************************

那么问题来了：

使用次数最高的前几个名字是什么呢

df.sort_values(by="Count_AnimalName",ascending=False)

那么问题又来了：

如果我的数据有10列，我想按照其中的第1，第3，第8列排序，怎么办？

pandas之取行或者列

刚刚我们知道了如何给数据按照某一行或者列排序，那么现在我们想单独研究使用次数前100的数据，应该如何做？

df_sorted = df.sort_values(by="Count_AnimalName")
df_sorted[:100]

那么问题来了：

我们具体要选择某一列该怎么选择呢？

df[" Count_AnimalName "]

我们要同时选择行和列改怎么办？

df[:100][" Count_AnimalName "]

pandas之loc

还有更多的经过pandas优化过的选择方式：

df.loc 通过标签索引行数据
df.iloc 通过位置获取行数据

In [33]: t = pd.DataFrame(np.arange(12).reshape((3,4)), index=list(string.ascii_uppercase[:3]), columns=list(string.ascii_uppercase[-4:]))

In [34]: t
Out[34]: 
   W  X   Y   Z
A  0  1   2   3
B  4  5   6   7
C  8  9  10  11

In [35]: t.loc["A", "W"]
Out[35]: 0

In [36]: t.loc["A",  ["W","Z"]]
Out[36]: 
W    0
Z    3
Name: A, dtype: int64

In [37]: type(t.loc["A", ["W", "Z"]])
Out[37]: pandas.core.series.Series

# 选择间隔的多行多列
In [38]: t.loc[["A", "C"], ["W", "Z"]]
Out[38]: 
   W   Z
A  0   3
C  8  11

In [39]: t.loc["A":, ["W", "Z"]]
Out[39]: 
   W   Z
A  0   3
B  4   7
C  8  11

# 冒号在loc里面是闭合的，即会选择到冒号后面的数据
In [40]: t.loc["A":"C", ["W","Z"]]
Out[40]: 
   W   Z
A  0   3
B  4   7
C  8  11

pandas之iloc

In [41]: t.iloc[1:3, [2,3]]
Out[41]: 
    Y   Z
B   6   7
C  10  11

In [42]: t.iloc[1:3, 1:3]
Out[42]: 
   X   Y
B  5   6
C  9  10

赋值更改数据的过程：

In [43]: t.loc["A", "Y"] = 100

In [44]: t
Out[44]: 
   W  X    Y   Z
A  0  1  100   3
B  4  5    6   7
C  8  9   10  11

In [45]: t.iloc[1:2, 0:2] = 200

In [46]: t
Out[46]: 
     W    X    Y   Z
A    0    1  100   3
B  200  200    6   7
C    8    9   10  11

pandas之布尔索引

回到之前狗的名字的问题上，假如我们想找到所有的使用次数超过800的狗的名字，应该怎么选择？

In [50]: df[df["Count_AnimalName"] > 800]
Out[50]: 
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
3251        COCO               852
9140         MAX              1153
12368      ROCKY               823

假如我们想找到所有的使用次数超过700并且名字的字符串的长度大于4的狗的名字，应该怎么选择？

In [51]: df[(df["Row_Labels"].str.len() > 4) & (df["Count_AnimalName"] > 700)]
Out[51]: 
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
8552       LUCKY               723
12368      ROCKY               823

& 且
| 或

注意点：不同的条件之间需要用括号括起来

pandas之字符串方法

方法	说明
cat	实现元素级的字符串连接操作，可指定分隔符
contains	返回表示各字符串是否含有指定模式的布尔型数组
count	模式的出现次数
endswith. startswith	相当于对各个元素执行x.endswith(pattern)或x.startswith(pattern)
findall	计算各字符串的模式列表
get	获取各元素的第i个字符
join	根据指定的分隔符将Series中各元素的字符串连接起来
len	计算各字符串的长度
lower. upper	转换大小写。相当于对各个元素执行x.lower()或x.upper()
match	根据指定的正则表达式对各个元素执行re.match
pad	在字符串的左边、右边或左右两边添加空白符
center	相当于pad(side=’both)
repeat	重复值。例如，s.str.repeat(3)相当于对各个字符串执行x* 3
replace	用指定字符串替换找到的模式
slice	对Series中的各个字符串进行子串截取
split	根据分隔符或正则表达式对字符串进行拆分
strip、rstrip. Istrip	去除空白符，包括换行符。相当于对各个元素执行x.strip()、x.rstrip()、x.lstrip()

缺失数据的处理

观察下面这组数据

img

我们的数据缺失通常有两种情况：
一种就是空，None等，在pandas是NaN(和np.nan一样)
另一种是我们让其为0，蓝色框中

对于NaN的数据，在numpy中我们是如何处理的？
在pandas中我们处理起来非常容易

判断数据是否为NaN：pd.isnull(df), pd.notnull(df)

处理方式1：删除NaN所在的行列dropna (axis=0, how=’any’, inplace=False)
处理方式2：填充数据，t.fillna(t.mean()), t.fiallna(t.median()), t.fillna(0)

处理为0的数据：t[t==0]=np.nan
当然并不是每次为0的数据都需要处理
计算平均值等情况，nan是不参与计算的，但是0会

“不要对任何人有道德上的洁癖，这个世界上的任何灵魂
都是半人半鬼，凑得太近，谁都没法看。”
Macsen Chu

pandas基本使用(一)

什么是pandas

为什么要学习pandas

pandas的常用数据类型

pandas之Series创建

pandas之Series切片和索引

pandas之Series的索引和值

pandas之读取外部数据

pandas之读取外部数据

pandas之DataFrame

DataFrame****对象既有行索引，又有列索引

pandas之DataFrame

DateFrame的基础属性

DateFrame整体情况查询

pandas之取行或者列

pandas之loc

pandas之iloc

赋值更改数据的过程：

pandas之布尔索引

pandas之字符串方法

缺失数据的处理

DataFrame对象既有行索引，又有列索引