Data Carpentry Workshop - Day 2 -Data Manipulation using dplyr and tidyr

IMG_7436.JPG

Main Contents

  • Describe the purpose of the dplyr and tidyr packages.
  • Select certain columns in a data frame with the dplyr function select.
  • Select certain rows in a data frame according to filtering conditions with the dplyr function filter .
  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
  • Add new columns to a data frame that are functions of existing columns with mutate.
  • Use the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and count to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.
  • Describe what key-value pairs are.
  • Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.
  • Export a data frame to a .csv file.

1. Install and load tidyverse

install.packages("tidyverse")
library("tidyverse")
  • tidyverse including tidyr,dplyr,ggplot2,tibble et.al.

2. What are dplyr and tidyr packages?

  • The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++).
  • The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions.
  • We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().
Characteristic of tibble:
  • In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
  • Columns of class character are never converted into factors.

3. Functions of dplyr.

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statisitcs on grouped data
  • arrange(): sort results
  • count(): count discrete value
3.1 Selecting columns and filtering rows
select(surveys,plot_id,species_id,weight) #select columns
filter(surveys,year == 1995)  # filter rows
3.2 Pipes
Three ways to do select and filter at the same time:
(1) using intermediate steps
surveys2 <- filter(surveys,weight <5)
surveys_sml1 <- select(surveys2,species_id,sex,weight)
view(surveys_sml)
(2) nest functions
surveys_sml2 <- select(filter(surveys,weight <5),species_id,sex,weight)
surveys_sml2
(3) pipes
surveys_sml3 <- surveys %>% 
filter(weight <5) %>% 
select(species_id,sex,weight)
surveys_sml3
Challenge
surveys_select <- surveys %>% 
filter(year < 1995) %>% 
select(year,sex,weight)
surveys_select
3.3 Split-apply-combine data analysis and the summarize() function
split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
(1) group_by() & summarize()
# group by single columns
surveys %>% 
group_by(sex) %>% 
summarize(mean_weight = mean(weight,na.rm = TRUE))
# group by multiple columns
surveys_test <- surveys %>% 
filter(!is.na(weight)) %>% 
group_by(sex,species_id) %>% 
summarize(mean_weight = mean(weight)) %>% 
print(n =15)
View(surveys_test)
# summarize multiple variables at the same time
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>% 
print(n =10)
(2) arrange
# rearrange by a column content
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(min_weight) %>% 
print(n=10)
#sort in descending order
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(desc(mean_weight)) %>% 
print(n=10)
(3) counting
 # count a single factor
surveys %>%
count(sex)
surveys %>%
count(sex,sort =TRUE)
# count combination of factors
surveys %>%
count(sex,species)
# count and arrange
surveys %>%
count(sex,species) %>% 
arrange(species,desc(n))
(4) challenge
#counting
surveys %>%
count(plot_id)
# group_by & summarize
surveys %>%
filter(!is.na(hindfoot_length)) %>% 
group_by(species_id) %>% 
summarize(mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n =n())
# arrange
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>% 
filter(weight == max(weight)) %>% 
select(year,genus,species_id,weight) %>%
arrange(year)

4. tidyr functions: reshaping with spread and gather

4.1 spreading
three principal arguments:
  • (1) the data
  • (2) the key column variable whose value will become new column names;
  • (3) the value column whose values will fill the new column variables.
  • others :fill.
spread code
# step1. prepare for data
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>% 
group_by(genus,plot_id) %>% 
summarize(mean_weight = mean(weight))
str(surveys_gw)
# step2. spread
surveys_spread <- surveys_gw %>% 
spread(key = genus,value = mean_weight)
str(surveys_spread)
# step3. fill
surveys_spread %>%
spread(genus, mean_weight, fill = 0) %>%
head()
spread
4.2 Gathering
four principal arguments:
  • (1) the data
  • (2) the key column variable we wish to create from column names.
  • (3) values column variable we wish to create and fill with values associated with the key.
  • (4) the names of the columns we use to fill the key variable (or to drop).
gather code
surveys_gather <- surveys_spread %>%
gather(key = genus, value = mean_weight, -plot_id)
str(surveys_gather)
gather
challenge

5. Export Data

(1)filter
surveys_complete <- surveys %>%
filter(!is.na(weight),           # remove missing weight
     !is.na(hindfoot_length),  # remove missing hindfoot_length
     !is.na(sex))                # remove missing sex
(2) extract
# Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>% 
filter(n >= 50)
(3) Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
dim(surveys_complete)
(4) save
write_csv(surveys_complete, path = "C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")
read_csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")

下期预告

Data visualization with ggplot2

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,878评论 0 10
  • 心灵自由 30天写作 认真的从网上搜索了有关裸贷的材料,翻看了两篇,没有定力再往下看了。一个个年轻的女孩子,裸着自...
    我就是云锦阅读 489评论 0 0
  • 燕地风杀人,今日始方知。 匆匆出门去,性命后悔迟。 北极有寒涡,凌侵如铁甲。 昨夜入关来,风谗更威加。 寒随大风来...
    柏舟BAEJOU阅读 240评论 0 0

友情链接更多精彩内容