前言
使用Pandas
import pandas as pd
import numpy as np
import datetime
刚刚开始学python。基于个人科研过程中遇到的问题做一下笔记。
将日期设置为索引:pd.to_datetime
气象中经常要计算年/季/月平均,若是处理站点数据还涉及hourly的数据,因此将日期作为索引可以方便后续的处理。
常用使用方法
- 由DataFrame的多列组成. 列名可以是 [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) 或者类似的词。
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'day': [4, 5]})
pd.to_datetime(df)
Out[9]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
- 由字符串转换而成。可以用
format='%Y%m%d'
之类的来指定格式。
pd.to_datetime('197901010600')
Out[2]: Timestamp('1979-01-01 06:00:00')
pd.to_datetime('1979-01-01 00') # 日期和时间之间需要有个空格
Out[7]: Timestamp('1979-01-01 00:00:00')
需注意,使用to_datetime
生成的Timestamp有范围限制:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
因此如果超出限制,不会生成Timestamp,只会显示datetime.datetime格式。
In [5]: pd.to_datetime('13000101',format='%Y%m%d', errors='ignore')
Out[5]: datetime.datetime(1300, 1, 1, 0, 0)
可以用errors='ignore'
将其置为NaT
In [6]: pd.to_datetime('13000101',format='%Y%m%d', errors='coerce')
Out[6]: NaT
实践示例
示例一
数据为ISD站点数据,原文件为以逗号分隔的csv格式。部分数据用excel预览如下
可见原数据已经有列名,且日期单独为一列,可以直接用
pd.read_csv
读取
df = pd.read_csv(file, dtype={'STATION':str})
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.set_index('DATE') # set date as index
示例二
如果日期是如下图更常见的以空格分隔的形式,同样可以用用pd.to_datetime
转换。
各列依次为:station ID, station type, year, month, day, standard report hour, actual report time...
colNames = ['id','fmtflag', 'year', 'month', 'day', 'hour', 'time', 'lat', 'lon', 'elev',
'ww', 'pw', 'slp', 'stp', 't', 'td'] #present weather, past weather, sea level pressure, station pressure
data = pd.read_table(file, header=None, names=colNames,
delim_whitespace=True, dtype={'id':str}) # read_csv也行
df['date'] = pd.to_datetime(df.loc[:, ['year', 'month', 'day', 'hour'] ])
df = df.set_index('DATE')
生成日期序列
pd.date_range
months = pd.date_range('1973-01', '2020-01', freq='M')
如果要显示为‘‘yyyymm’’的格式(在读取数据时会用到,例如有些数据路径为/200701/...)
months = pd.date_range('1973-01', '2020-01', freq='M').strftime('%Y%M')
strftime是把时间转换成string格式
通过日期索引选取数据
1. 直接通过日期字符串选取
选取某一年:
In [27]: df['1973']
Out[27]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
1973-12-31 09:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 12:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 15:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-12-31 18:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-12-31 21:00:00 01001099999 -3.0 1 ... 1 NaN NaN
[2801 rows x 19 columns]
或者选取某个区间内的年份:
In [39]: df['1973':'1974'] # 包括1994年
Out[39]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
1974-12-31 09:00:00 01001099999 0.0 1 ... 1 NaN NaN
1974-12-31 12:00:00 01001099999 0.0 1 ... 1 NaN NaN
1974-12-31 15:00:00 01001099999 -4.0 1 ... 1 NaN NaN
1974-12-31 18:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1974-12-31 21:00:00 01001099999 -5.0 1 ... 1 NaN NaN
[5517 rows x 19 columns]
按年份选取应该是最常用的。
如果只需要年平均之类的话,groupby(updating) ......
较复杂的逐年的数据处理,需要通过循环遍历所有年份,对每一年数据单独处理:
df_years = df.index.to_period('A').unique().year #将index按年份显示,取唯一值,再取年份
Out[29]:
Int64Index([1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983,
1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994,
1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
2017, 2018, 2019],
dtype='int64')
for yr in df_years:
df_yr = df[str(yr)] # 索引得是字符
balabala……
了解一下选月、日、小时:
# 选取某月
df['1973-01'] # df['197301']会报错
# 选取某天
df['1973-01-01'] # 我的数据包含小时、分钟数据所以这么写不报错
df['1973-01-01':'1973-01-01'] # 用区间来选取
# 选取某时刻
df['1973-01-01 06'] # (目前只知道最多选到小时为止)
用.loc
也可以,更保险
df.loc['1973':'1974'] # 得到的结果和上面的df['1973':'1974']是一样的
2. 通过条件筛选index
我一般是通过index来进行筛选。datetime格式的index可以提取year, month, day, hour的信息:
In [30]: df.index.month
Out[30]:
Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
dtype='int64', length=98404)
然后用boolean筛选就行了:
In [31]: df[df.index.month==1]
Out[31]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 00:00:00 01001099999 -3.0 1 ... 1 NaN NaN
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-01 12:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-01 18:00:00 01001099999 -6.0 1 ... 1 NaN NaN
1973-01-02 00:00:00 01001099999 -9.0 1 ... 1 NaN NaN
... ... ... ... ... ... ...
2020-01-30 18:00:00 01001099999 1.2 1 ... 1 NaN NaN
2020-01-31 00:00:00 01001099999 1.2 1 ... 1 NaN NaN
2020-01-31 09:00:00 01001099999 0.6 1 ... 1 NaN NaN
2020-01-31 15:00:00 01001099999 -1.0 1 ... 1 NaN NaN
2020-01-31 18:00:00 01001099999 -1.4 1 ... 1 NaN NaN
[8786 rows x 19 columns]
3. truncate
可以去掉某日期之前或之后的数据。
dates = pd.date_range('2016-01-01', '2016-02-01', freq='s')
df = pd.DataFrame(index=dates, data={'A': 1})
# 去掉1-5之前和1-10之后的行
df.truncate(before=pd.Timestamp('2016-01-05'),
after=pd.Timestamp('2016-01-10'))
# 也可以简单地写成:
df.truncate('2016-01-05', '2016-01-10')
# truncate方法默认只匹配到0时,最后一个值是2016-01-10 00:00:00
# 对比df.loc['2016-01-05':'2016-01-10', :], 最后一个值是2016-01-10 23:59:59
改变日期显示方式
1. to_period (updating
df.index.to_period('A').unique().year
# process day by day
dates = df_ww.index.to_period('d').strftime('%Y-%m-%d').unique()
2. asfreq (updating)
说实在的我还没搞清楚这个函数怎么用QAQ
3. astype('datetime64[M]')
转换成np.datetime64格式,返回的是array
df.index.values
Out[43]:
array(['1978-01-01T00:00:00.000000000', '1978-01-01T06:00:00.000000000',
'1978-01-01T12:00:00.000000000', ...,
'2007-02-28T17:00:00.000000000', '2007-02-28T19:00:00.000000000',
'2007-02-28T20:00:00.000000000'], dtype='datetime64[ns]')
df.index.values.astype('datetime64[M]')
Out[44]:
array(['1978-01', '1978-01', '1978-01', ..., '2007-02', '2007-02',
'2007-02'], dtype='datetime64[M]')
按年份/月份批量处理(计数、求和、平均): resample
df.resample('M')
返回的是一个<pandas.core.resample.DatetimeIndexResampler object>
, 后面要再加上具体的method才可以显示结果,例如.count()
, .sum()
, .mean()
...
# monthly number of ww observations
In [36]: num = df['WW_VALUE'].resample('M').count()
Out[36]:
1973-01-31 217
1973-02-28 210
1973-03-31 242
1973-04-30 237
1973-05-31 243
2020-02-29 159
2020-03-31 180
2020-04-30 179
2020-05-31 186
2020-06-30 89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64
返回的index自动设置成了每月的最后一天。如果想只保留“年-月”格式的话,可以用to_period('m')
In [37]: num = df['WW_VALUE'].resample('M').count().to_period('m') # "m"大小写没有影响。但“Y”和“y”格式有差别。
Out[37]:
1973-01 217
1973-02 210
1973-03 242
1973-04 237
1973-05 243
2020-02 159
2020-03 180
2020-04 179
2020-05 186
2020-06 89
Freq: M, Name: WW_VALUE, Length: 570, dtype: int64
选取某个时间区间内的数据
Pandas.DataFrame.between_time
一个简单粗暴的选取时间段内数据的方法:
In [38]: df.between_time('5:00', '7:00')
Out[38]:
STATION TMP_VALUE TMP_FLAG ... AY_FLAG AZ_VALUE AZ_FLAG
1973-01-01 06:00:00 01001099999 -2.0 1 ... 1 NaN NaN
1973-01-02 06:00:00 01001099999 -9.0 1 ... 1 NaN NaN
1973-01-03 06:00:00 01001099999 -5.0 1 ... 1 NaN NaN
1973-01-05 06:00:00 01001099999 -10.0 1 ... 1 NaN NaN
1973-01-06 06:00:00 01001099999 -11.0 1 ... 1 NaN NaN
日期/时间增减Timedelta
选时间段,也可以通过时间的增减来完成。pandas的好处是会自动计算加减时间之后的日期。
In [40]: pd.Timedelta('6h')
Out[40]: Timedelta('0 days 06:00:00')
In [41]: pd.Timedelta(6,unit='h')
Out[41]: Timedelta('0 days 06:00:00')
In[42]: pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')
Out[42]: Timedelta('2 days 02:15:30')
Timedelta中的unit:
- ‘Y’, ‘M’, ‘W’, ‘D’, 'H', ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’ (我补充了一个'H'。对应年、月、周、日、时、分、秒、毫秒、微秒、纳秒)
- ‘days’ or ‘day’
- ‘hours’, ‘hour’, ‘hr’, or ‘h’
- ‘minutes’, ‘minute’, ‘min’, or ‘m’
- ‘seconds’, ‘second’, or ‘sec’
- ...【其余参见官方文档】
0点前后一小时数据的选取方法:
repo00 = pd.to_datetime(date+' 00')
df_date00 = df[
(df.index >= repo00 - pd.Timedelta('1h'))
& (df.index <= repo00 + pd.Timedelta('1h'))
]
日期相加减
这个项目中,经常遇到时间不是整点的情况(例如5:45, 6:15),每天的数据量也有所不同。为了将数据限定在一日四次,我需要选出00, 06, 12, 18点前后1小时内的数据,并取其中与其最接近的观测值。这一点涉及时间的差,不能通过resample来实现。
# the closest observation within 1h of 06Z, 12Z, 18Z
df_date = df_ww[date]
obs06 = df_date.between_time('5:00', '7:00').index # 06点的数据的index
delta06 = abs((obs06 - pd.to_datetime(date+' 06')).values)
if delta06.size > 0:
df_resampled = pd.concat([df_resampled,
df_date.loc[[obs06[np.argmin(delta06)]]]
])
这里注意,两个datetime相减之后得到的结果以ns为单位。如果需要变换单位,可以用.dt.seconds
或者.values
取出数值,再进行变换。参考http://blog.gqylpy.com/gqy/22545/#pandas_74
df['diff_time'] = (df['tm_1'] - df['tm_2']).dt.seconds/60 # to minutes
df['diff_time'] = (df['tm_1'] - df['tm_2']).values/np.timedelta64(1, 'h') # to hours
与numpy datetime64格式的比较
1.
起因是要做两个dataset的collocation,需要合并两个dataset相同时间下的气象变量。
其中一个tablet data读取为pd.DataFrame,由pd.to_datetime转换日期格式并设置为index:
In [31] df.index
Out[31]:
DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 06:00:00',
'1979-01-01 12:00:00', '1979-01-01 18:00:00',
'1979-01-02 00:00:00', '1979-01-02 06:00:00',
'1979-01-02 12:00:00', '1979-01-02 18:00:00',
'1979-01-03 00:00:00', '1979-01-03 06:00:00',
...
'1979-01-29 12:00:00', '1979-01-29 18:00:00',
'1979-01-30 00:00:00', '1979-01-30 06:00:00',
'1979-01-30 12:00:00', '1979-01-30 18:00:00',
'1979-01-31 00:00:00', '1979-01-31 06:00:00',
'1979-01-31 12:00:00', '1979-01-31 18:00:00'],
dtype='datetime64[ns]', name='date', length=124, freq=None)
另一个是nc数据,用xarray读取变量,其time坐标默认为np.datetime64格式
In [33]: tk.time
Out[33]:
<xarray.DataArray 'time' (time: 744)>
array(['1979-01-01T00:00:00.000000000', '1979-01-01T01:00:00.000000000',
'1979-01-01T02:00:00.000000000', ..., '1979-01-31T21:00:00.000000000',
'1979-01-31T22:00:00.000000000', '1979-01-31T23:00:00.000000000'],
dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 1979-01-01 ... 1979-01-31T23:00:00
Attributes:
long_name: time
【我居然忘了当时纠结了很久的bug是什么】
如果写np.argwhere(df.index[0] == tk.time.values)
得到的是空集
因此都用values
提取数值进行对比即可
In [77]: np.argwhere(df.index.values[0] == tk.time.values)
Out[77]: array([[0]])
# 用int()提取数值:
In [78]: int(np.argwhere(df.index.values[0] == tk.time.values))
Out[78]: 0
2. Timestamp, datetime.datetime, np.datetime64之间的转换
本人暂时用不到。先参考这个链接https://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64
pd.to_datetime再转换成np.datetime64的格式, 只需要加上np.array:
date_n = np.array(pd.to_datetime(df[['year', 'month', 'day', 'hour']]) )
Out[83]:
array(['1979-01-01T00:00:00.000000000', '1979-01-01T06:00:00.000000000',
'1979-01-01T12:00:00.000000000', '1979-01-01T18:00:00.000000000',
...
'1979-01-31T00:00:00.000000000', '1979-01-31T06:00:00.000000000',
'1979-01-31T12:00:00.000000000', '1979-01-31T18:00:00.000000000'],
dtype='datetime64[ns]')
(1) python datetime => datetime64 / Timestamp
dt = datetime.datetime(year=2017, month=10, day=24, hour=4,
minute=3, second=10, microsecond=7199)
>>> np.datetime64(dt)
numpy.datetime64('2017-10-24T04:03:10.007199')
>>> pd.Timestamp(dt) # or pd.to_datetime(dt)
Timestamp('2017-10-24 04:03:10.007199')
(2) numpy datetime64 => Timestamp
In [86]: pd.Timestamp(np.datetime64('2012-05-01T01:00:00.000000'))
Out[86]: Timestamp('2012-05-01 01:00:00')
In [87]: pd.to_datetime('2012-05-01T01:00:00.000000+0100')
Out[87]: Timestamp('2012-05-01 01:00:00+0100', tz='pytz.FixedOffset(60)')
In [88]: pd.to_datetime('2012-05-01T01:00:00.000000+0100').replace(tzinfo=None)
Out[88]: Timestamp('2012-05-01 01:00:00')
np.datetime64 => datetime似乎有点复杂。以后需要的话再查一查。
(3) Timestamp => datetime / datetime64
>>> ts = pd.Timestamp('2017-10-24 04:24:33.654321')
>>> ts.to_pydatetime() # Python's datetime
datetime.datetime(2017, 10, 24, 4, 24, 33, 654321)
>>> ts.to_datetime64()
numpy.datetime64('2017-10-24T04:24:33.654321000')
模拟matlab的tic, toc计时功能
https://blog.csdn.net/u010199776/article/details/69941965
import datetime
tic = datetime.datetime.now()
...
toc = datetime.datetime.now()
print('%s, Elapsed time: %f seconds' % (toc, (toc-tic).total_seconds() ))
2020-07-19 16:23:13.321484, Elapsed time: 5.371069 seconds