练习书8-《python数据科学手册》

pandas的向量化字符串和时间序列处理。

代码

import numpy as np
import pandas as pd
import time
from datetime import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
from dateutil import parser
from pandas.tseries.offsets import BDay

# plt.style.use('classic')
plt.style.use('seaborn-whitegrid')
np.random.seed(0)
# 配置pandas显示
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

# 向量化字符串操作(vectorized string operation)
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
# print([s.capitalize() for s in data])
names = pd.Series(data)
print(names)
print(names.str.capitalize())

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle',
                   'Terry Jones', 'Michael Palin'])
print(monte.str.lower())
print(monte.str.len())
print(monte.str.startswith('T'))
print(monte.str.split())
print(monte.str.extract('([A-Za-z]+)'))
print(monte.str.findall(r'^[^AEIOU].*[^aeiou]$'))
print(monte.str[0:3])
print(monte.str.split().str.get(-1))
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C',
                                    'B|C|D']})
print(full_monte)

print(full_monte['info'].str.get_dummies('|'))
# 处理时间序列

print(datetime(year=2015, month=7, day=4))
date = parser.parse("4th of July, 2015")
print(date)
print(date.strftime('%A'))

date = np.array('2015-07-04', dtype=np.datetime64)
print(date)
print(date + np.arange(12))
print(np.datetime64('2015-07-04 12:00'))

date = pd.to_datetime('4th of July, 2015')
print(date)
print(date.strftime('%A'))
print(date + pd.to_timedelta(np.arange(12), 'D'))

index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
print(data)
print(data['2014-07-04': '2015-07-04'])
print(data['2015'])

# • 针对时间戳数据,Pandas 提供了Timestamp 类型。与前面介绍的一样,它本质上是 Python 的原生 datetime 类型的替代品,
# 但是在性能更好的 numpy.datetime64 类型的基 础上创建。对应的索引数据结构是 DatetimeIndex。
# • 针对时间周期数据,Pandas 提供了 Period 类型。这是利用 numpy.datetime64 类型将固 定频率的时间间隔进行编码。
# 对应的索引数据结构是 PeriodIndex。
# • 针对时间增量或持续时间,Pandas 提供了 Timedelta 类型。Timedelta 是一种代替 Python 原生 datetime.timedelta 类型的高性能数据结构,
# 同样是基于 numpy.timedelta64 类型。 对应的索引数据结构是 TimedeltaIndex。

dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                        '2015-Jul-6', '07-07-2015', '20150708'])
print(dates)
print(dates.to_period('D'))
print(dates - dates[0])

# pd.date_range()可以处理时间戳、pd.period_range()可以处理周期、pd.timedelta_range()可以处理时间间隔。
# 我们已经介绍过,Python的range() 和 NumPy的np.arange()可以用起点、终点和步长(可选的)创建一个序列。
print(pd.date_range('2015-07-03', '2015-07-10'))
print(pd.date_range('2015-07-03', periods=8))
print(pd.date_range('2015-07-03', periods=8, freq='H'))

print(pd.period_range('2015-07', periods=8, freq='M'))
print(pd.timedelta_range(0, periods=10, freq='H'))
print(pd.timedelta_range(0, periods=9, freq='2H30T'))
print(pd.date_range('2015-07-01', periods=5, freq=BDay()))
# 处理时间序列数据时,经常需要按照新的频率(更高频率、更低频率)对数据进行重新取样。
# 你可以通过resample()方法解决这个问题,或者用更简单的asfreq()方法。
# 这两个方法的主要差异在于,resample()方法是以数据累计(data aggregation)为基础,
# 而 asfreq()方法是以数据选择(data selection)为基础。

输出

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object
0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object
0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object
0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool
0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object
         0
0   Graham
1     John
2    Terry
3     Eric
4    Terry
5  Michael
0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object
0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object
0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object
             name   info
0  Graham Chapman  B|C|D
1     John Cleese    B|D
2   Terry Gilliam    A|C
3       Eric Idle    B|D
4     Terry Jones    B|C
5   Michael Palin  B|C|D
   A  B  C  D
0  0  1  1  1
1  0  1  0  1
2  1  0  1  0
3  0  1  0  1
4  0  1  1  0
5  0  1  1  1
2015-07-04 00:00:00
2015-07-04 00:00:00
Saturday
2015-07-04
['2015-07-04' '2015-07-05' '2015-07-06' '2015-07-07' '2015-07-08'
 '2015-07-09' '2015-07-10' '2015-07-11' '2015-07-12' '2015-07-13'
 '2015-07-14' '2015-07-15']
2015-07-04T12:00
2015-07-04 00:00:00
Saturday
DatetimeIndex(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
               '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
               '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
              dtype='datetime64[ns]', freq=None)
2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64
2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64
2015-07-04    2
2015-08-04    3
dtype: int64
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)
PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='period[D]', freq='D')
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00'],
              dtype='datetime64[ns]', freq='H')
PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
             '2016-01', '2016-02'],
            dtype='period[M]', freq='M')
TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')
TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
                '0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00',
                '0 days 15:00:00', '0 days 17:30:00', '0 days 20:00:00'],
               dtype='timedelta64[ns]', freq='150T')
DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-06',
               '2015-07-07'],
              dtype='datetime64[ns]', freq='B')
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。