Data Carpentry Workshop - Day 2 -Data Manipulation using dplyr and tidyr

IMG_7436.JPG

Main Contents

Describe the purpose of the dplyr and tidyr packages.
Select certain columns in a data frame with the dplyr function select.
Select certain rows in a data frame according to filtering conditions with the dplyr function filter .
Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
Add new columns to a data frame that are functions of existing columns with mutate.
Use the split-apply-combine concept for data analysis.
Use summarize, group_by, and count to split a data frame into groups of observations, apply a summary statistics for each group, and then combine the results.
Describe the concept of a wide and a long table format and for which purpose those formats are useful.
Describe what key-value pairs are.
Reshape a data frame from long to wide format and back with the spread and gather commands from the tidyr package.
Export a data frame to a .csv file.

1. Install and load tidyverse

install.packages("tidyverse")
library("tidyverse")

tidyverse including tidyr,dplyr,ggplot2,tibble et.al.

2. What are dplyr and tidyr packages?

The package dplyr provides easy tools for the most common data manipulation tasks. It is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++).
The package tidyr addresses the common problem of wanting to reshape your data for plotting and use by different R functions.
We’ll read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().

Characteristic of tibble:

In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
Columns of class character are never converted into factors.

3. Functions of dplyr.

select(): subset columns
filter(): subset rows on conditions
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statisitcs on grouped data
arrange(): sort results
count(): count discrete value

3.1 Selecting columns and filtering rows

select(surveys,plot_id,species_id,weight) #select columns
filter(surveys,year == 1995)  # filter rows

3.2 Pipes

Three ways to do select and filter at the same time:

(1) using intermediate steps

surveys2 <- filter(surveys,weight <5)
surveys_sml1 <- select(surveys2,species_id,sex,weight)
view(surveys_sml)

(2) nest functions

surveys_sml2 <- select(filter(surveys,weight <5),species_id,sex,weight)
surveys_sml2

(3) pipes

surveys_sml3 <- surveys %>% 
filter(weight <5) %>% 
select(species_id,sex,weight)
surveys_sml3

Challenge

surveys_select <- surveys %>% 
filter(year < 1995) %>% 
select(year,sex,weight)
surveys_select

3.3 Split-apply-combine data analysis and the summarize() function

split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

(1) group_by() & summarize()

# group by single columns
surveys %>% 
group_by(sex) %>% 
summarize(mean_weight = mean(weight,na.rm = TRUE))
# group by multiple columns
surveys_test <- surveys %>% 
filter(!is.na(weight)) %>% 
group_by(sex,species_id) %>% 
summarize(mean_weight = mean(weight)) %>% 
print(n =15)
View(surveys_test)
# summarize multiple variables at the same time
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>% 
print(n =10)

(2) arrange

# rearrange by a column content
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(min_weight) %>% 
print(n=10)
#sort in descending order
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
          min_weight = min(weight)) %>%
arrange(desc(mean_weight)) %>% 
print(n=10)

(3) counting

 # count a single factor
surveys %>%
count(sex)
surveys %>%
count(sex,sort =TRUE)
# count combination of factors
surveys %>%
count(sex,species)
# count and arrange
surveys %>%
count(sex,species) %>% 
arrange(species,desc(n))

(4) challenge

#counting
surveys %>%
count(plot_id)
# group_by & summarize
surveys %>%
filter(!is.na(hindfoot_length)) %>% 
group_by(species_id) %>% 
summarize(mean_hindfoot_length = mean(hindfoot_length),
        min_hindfoot_length = min(hindfoot_length),
        max_hindfoot_length = max(hindfoot_length),
        n =n())
# arrange
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>% 
filter(weight == max(weight)) %>% 
select(year,genus,species_id,weight) %>%
arrange(year)

4. tidyr functions: reshaping with spread and gather

4.1 spreading

three principal arguments:

(1) the data
(2) the key column variable whose value will become new column names;
(3) the value column whose values will fill the new column variables.
others :fill.

spread code

# step1. prepare for data
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>% 
group_by(genus,plot_id) %>% 
summarize(mean_weight = mean(weight))
str(surveys_gw)
# step2. spread
surveys_spread <- surveys_gw %>% 
spread(key = genus,value = mean_weight)
str(surveys_spread)
# step3. fill
surveys_spread %>%
spread(genus, mean_weight, fill = 0) %>%
head()

spread

4.2 Gathering

four principal arguments:

(1) the data
(2) the key column variable we wish to create from column names.
(3) values column variable we wish to create and fill with values associated with the key.
(4) the names of the columns we use to fill the key variable (or to drop).

gather code

surveys_gather <- surveys_spread %>%
gather(key = genus, value = mean_weight, -plot_id)
str(surveys_gather)

gather

challenge

5. Export Data

（1）filter

surveys_complete <- surveys %>%
filter(!is.na(weight),           # remove missing weight
     !is.na(hindfoot_length),  # remove missing hindfoot_length
     !is.na(sex))                # remove missing sex

(2) extract

# Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>% 
filter(n >= 50)

(3) Only keep the most common species

surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
dim(surveys_complete)

(4) save

write_csv(surveys_complete, path = "C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")
read_csv("C:/Users/home/Desktop/Rcourse/DataCarpentry33/Outputs/surveys_complete-data.csv")

Data Carpentry Workshop - Day 2 -Data Manipulation using dplyr and tidyr

Data Carpentry Workshop - Day 2 -Data Manipulation using dplyr and tidyr

Main Contents

1. Install and load tidyverse

2. What are dplyr and tidyr packages?

Characteristic of tibble:

3. Functions of dplyr.

3.1 Selecting columns and filtering rows

3.2 Pipes

Three ways to do select and filter at the same time:

(1) using intermediate steps

(2) nest functions

(3) pipes

Challenge

3.3 Split-apply-combine data analysis and the summarize() function

split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

(1) group_by() & summarize()

(2) arrange

(3) counting

(4) challenge

4. tidyr functions: reshaping with spread and gather

4.1 spreading

three principal arguments:

spread code

4.2 Gathering

four principal arguments:

gather code

challenge

5. Export Data

（1）filter

(2) extract

(3) Only keep the most common species

(4) save

下期预告

Data visualization with ggplot2

相关阅读更多精彩内容

友情链接更多精彩内容