学会使用和操作时间数据课程目录
Chapter1. R里的时间和数据
Chapter2. 操作和剖析时间数据
Chapter3. 对时间数据进行计算
Chapter4. 问题实践
Chapter1. R里的时间和数据
指定日期数据
时间数据有和别的数据不一样的数据属性。但是R并不会因为你输入了一个例如"2021-07-26"的数据就会自动判断这是一个时间数据,谁知道不是字符还是因子数据呢。所以得要告诉R这个数据的时间数据属性。会用到as.Date()
。
# The date R 3.0.0 was released
x <- "2013-04-03"
# Examine structure of x
str(x)
chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)
# Examine structure of x_date
str(x_date)
Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")
# The date R 3.0.0 was released
x <- "2013-04-03"
# Examine structure of x
str(x)
chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)
# Examine structure of x_date
str(x_date)
Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")
自动整合日期数据
有两个非常方便的包。一个是readr
,会自动识别时间数据。
先用read_csv()
读取文件。然后用str()
查看数据结构。
# Use read_csv() to import rversions.csv
releases <- read_csv("rversions.csv")
# Examine the structure of the date column
str(releases$date)
Date[1:105], format: "1997-12-04" "1997-12-21" "1998-01-10" "1998-03-14" "1998-05-02" ...
还有一个就是anytime
包,这个包可以自动整合时间数据。
# Load the anytime package
library(anytime)
Warning message: running command 'timedatectl' had status 1
# Various ways of writing Sep 10 2009
sep_10_2009 <- c("September 10 2009", "2009-09-10", "10 Sep 2009", "09-10-2009")
# Use anytime() to parse sep_10_2009
anytime(sep_10_2009)
[1] "2009-09-10 UTC" "2009-09-10 UTC" "2009-09-10 UTC" "2009-09-10 UTC"
日期数据可视化
根据major
对数据进行分组,然后指定时间范围,对时间数据进行可视化。
library(ggplot2)
# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major)))
# Limit the axis to between 2010-01-01 and 2014-01-01
ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major))) +
xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))
# Specify breaks every ten years and labels with "%Y"
ggplot(releases, aes(x = date, y = type)) +
geom_line(aes(group = 1, color = factor(major))) +
scale_x_date(date_breaks = "10 years", date_labels = "%Y")
日期数据的简单计算
选取数据集release
里date
列的最大值,也就是最近一次的relase。然后计算最近的一次release距今有多久了。
# Find the largest date
last_release_date <- max(releases$date)
# Filter row for last release
last_release <- filter(releases, date==last_release_date)
# Print last_release
last_release
# How long since last release?
Sys.Date() - last_release$date
时间数据
日期数据用as.Date()
,时间数据的话就要用到as.POSIXct()
。
时间数据的格式是YYYY-MM-DD HH:MM:SS
。还可以通过tz
参数来设置时区(timezone)。
# Use as.POSIXct to enter the datetime
as.POSIXct("2010-10-01 12:12:00")
# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")
# Use read_csv to import rversions.csv
releases <- read_csv("rversions.csv")
# Examine structure of datetime column
str(releases$datetime)
再来做一个练习,自己设定一个日期时间点,然后选取数据集里R_version
是3.2.0并且时间大于设定的时间点的数据。最后可视化一下数据分布。
# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("cran-logs_2015-04-17.csv")
# Print logs
logs
# Store the release time as a POSIXct object
release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")
# When is the first download of 3.2.0?
logs %>%
filter(logs$datetime>release_time,
r_version == "3.2.0")
# Examine histograms of downloads by version
ggplot(logs, aes(x = datetime)) +
geom_histogram() +
geom_vline(aes(xintercept = as.numeric(release_time)))+
facet_wrap(~ r_version, ncol = 1)