生信学习基础_R语言04_data-wrangling: Subsetting Vectors and Factors数据整理-取向量和因子的子集

原文地址:https://hbctraining.github.io/Intro-to-R/lessons/04_introR-data-wrangling.html

大神的中文整理版:https://www.jianshu.com/p/d1365c9f8422

本文是我拷贝的原文,加了自己的笔记和练习题答案。

Learning Objectives

  • Construct data structures to store external data in R.
  • Inspect data structures in R.
  • Demonstrate how to subset data from data structures.

Reading data into R

Regardless of the specific analysis in R we are performing, we usually need to bring data in for the analysis. The function in R we use will depend on the type of data file we are bringing in (e.g. text, Stata, SPSS, SAS, Excel, etc.) and how the data in that file are separated, or delimited. The table below lists functions that can be used to import data from common file formats.

Data type Extension Function Package
Comma separated values csv read.csv() utils (default)
read_csv() readr (tidyverse)
Tab separated values tsv read_tsv() readr
Other delimited formats txt read.table() utils
read_table() readr
read_delim() readr
Stata version 13-14 dta readdta() haven
Stata version 7-12 dta read.dta() foreign
SPSS sav read.spss() foreign
SAS sas7bdat read.sas7bdat() sas7bdat
Excel xlsx, xls read_excel() readxl (tidyverse)

For example, if we have text file separated by commas (comma-separated values), we could use the function read.csv. However, if the data are separated by a different delimiter in a text file, we could use the generic read.table function and specify the delimiter as an argument in the function.

When working with genomic data, we often have a metadata file containing information on each sample in our dataset. Let’s bring in the metadata file using the read.csv function. Check the arguments for the function to get an idea of the function options:

?read.csv

The read.csv function has one required argument and several options that can be specified. The mandatory argument is a path to the file and filename, which in our case is data/mouse_exp_design.csv. We will put the function to the right of the assignment operator, meaning that any output will be saved as the variable name provided on the left.

metadata <- read.csv(file="data/mouse_exp_design.csv")

Note: By default, read.csv converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE.

Inspecting data structures

There are a wide selection of base functions in R that are useful for inspecting your data and summarizing it. Let’s use the metadata file that we created to test out data inspection functions.

Take a look at the dataframe by typing out the variable name metadata and pressing return; the variable contains information describing the samples in our study. Each row holds information for a single sample, and the columns contain categorical information about the sample genotype(WT or KO), celltype (typeA or typeB), and replicate number (1,2, or 3).

metadata

          genotype celltype replicate
sample1        Wt    typeA      1
sample2        Wt    typeA      2
sample3        Wt    typeA      3
sample4        KO    typeA      1
sample5        KO    typeA      2
sample6        KO    typeA      3
sample7        Wt    typeB      1
sample8        Wt    typeB      2
sample9        Wt    typeB      3
sample10       KO    typeB      1
sample11       KO    typeB      2
sample12       KO    typeB      3

Suppose we had a larger file, we might not want to display all the contents in the console. Instead we could check the top (the first 6 lines) of this data.frame using the function head():

head(metadata)

Previously, we had mentioned that character values get converted to factors by default using data.frame. One way to assess this change would be to use the __str__ucture function. You will get specific details on each column:

str(metadata)

'data.frame':   12 obs. of  3 variables:
 $ genotype : Factor w/ 2 levels "KO","Wt": 2 2 2 1 1 1 2 2 2 1 ...
 $ celltype : Factor w/ 2 levels "typeA","typeB": 1 1 1 1 1 1 2 2 2 2 ...
 $ replicate: num  1 2 3 1 2 3 1 2 3 1 ...

As you can see, the columns genotype and celltype are of the factor class, whereas the replicate column has been interpreted as integer data type.

You can also get this information from the “Environment” tab in RStudio.

List of functions for data inspection

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of data.

  • All data structures - content display:
    • str(): compact display of data contents (env.)
    • class(): data type (e.g. character, numeric, etc.) of vectors and data structure of dataframes, matrices, and lists.
    • summary(): detailed display, including descriptive statistics, frequencies
    • head(): will print the beginning entries for the variable
    • tail(): will print the end entries for the variable
  • Vector and factor variables:
    • length(): returns the number of elements in the vector or factor
  • Dataframe and matrix variables:
    • dim(): returns dimensions of the dataset
    • nrow(): returns the number of rows in the dataset
    • ncol(): returns the number of columns in the dataset
    • rownames(): returns the row names in the dataset
    • colnames(): returns the column names in the dataset

Selecting data using indices and sequences

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors and how to access different elements, and then extend those concepts to dataframes.

Vectors

Selecting using indices

If we want to extract one or several values from a vector, we must provide one or several indices using square brackets [ ] syntax. The index represents the element number within a vector (or the compartment number, if you think of the bucket analogy). R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

Let’s start by creating a vector called age:

age <- c(15, 22, 45, 52, 73, 81)

vector indices

Suppose we only wanted the fifth value of this vector, we would use the following syntax:

age[5]

If we wanted all values except the fifth value of this vector, we would use the following:

age[-5]

If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:

idx <- c(3,5,6) # create vector of the elements of interest
age[idx]

To select a sequence of continuous values from a vector, we would use : which is a special function that creates numeric vectors of integer in increasing or decreasing order. Let’s select the first four values from age:

age[1:4]

Alternatively, if you wanted the reverse could try 4:1 for instance, and see what is returned.


Exercises

  1. Create a vector called alphabets with the following letters, C, D, X, L, F.
  2. Use the associated indices along with [ ] to do the following:
    • only display C, D and F
    • display all except X
    • display the letters in the opposite order (F, L, X, D, C)

alphabets <- c('C','D','X','L','F')
y <- c(1,2,5)
alphabets[y]
alphabets[-3]
alphabets[5:1]

Selecting using indices with logical operators

We can also use indices with logical operators. Logical operators include greater than (>), less than (<), and equal to (==). A full list of logical operators in R is displayed below:

Operator description

We can use logical expressions to determine whether a particular condition is true or false. For example, let’s use our age vector:

age

If we wanted to know if each element in our age vector is greater than 50, we could write the following expression:

age > 50

Returned is a vector of logical values the same length as age with TRUE and FALSE values indicating whether each element in the vector is greater than 50.

[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE

We can use these logical vectors to select only the elements in a vector with TRUE values at the same position or index as in the logical vector.

Create an index with logical operators to select all values in the age vector over 50 or age less than 18:

idx <- age > 50 | age < 18

idx

age

age[idx]

Indexing with logical operators using the which() function

While logical expressions will return a vector of TRUE and FALSE values of the same length, we could use the which() function to output the indices where the values are TRUE. Indexing with either method generates the same results, and personal preference determines which method you choose to use. For example:

idx <- which(age > 50 | age < 18)

idx

age[idx]

Notice that we get the same results regardless of whether or not we use the which(). Also note that while which() works the same as the logical expressions for indexing, it can be used for multiple other operations, where it is not interchangeable with logical expressions.

Note about Nesting functions:

Instead of creating the idx object in the above sections, we could have just place the logical operations and/or functions within the brackets.

age[which(age > 50 | age < 18)] is identical to age[idx] above.

Factors

Since factors are special vectors, the same rules for selecting values using indices apply. The elements of the expression factor created previously had the following categories or levels: low, medium, and high.

Let’s extract the values of the factor with high expression, and let’s using nesting here:

expression[expression == "high"]    ## This will only return those elements in the factor equal to "high"

Nesting note:

The piece of code above was more efficient with nesting; we used a single step instead of two steps as shown below:

Step1 (no nesting): idx <- expression == "high"

Step2 (no nesting): expression[idx]


Exercise

Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).

samplegroup
samplegroup[samplegroup!= "KO"]

Releveling factors

We have briefly talked about factors, but this data type only becomes more intuitive once you’ve had a chance to work with it. Let’s take a slight detour and learn about how to relevel categories within a factor.

To view the integer assignments under the hood you can use str():

expression

str(expression)
Factor w/ 3 levels "high","low","medium": 2 1 3 1 2 3 1

The categories are referred to as “factor levels”. As we learned earlier, the levels in the expressionfactor were assigned integers alphabetically, with high=1, low=2, medium=3. However, it makes more sense for us if low=1, medium=2 and high=3, i.e. it makes sense for us to “relevel” the categories in this factor.

To relevel the categories, you can add the levels argument to the factor() function, and give it a vector with the categories listed in the required order:

expression <- factor(expression, levels=c("low", "medium", "high"))     # you can re-factor a factor 

str(expression)
Factor w/ 3 levels "low","medium",..: 1 3 2 3 1 2 3

Now we have a releveled factor with low as the lowest or first category, medium as the second and high as the third. This is reflected in the way they are listed in the output of str(), as well as in the numbering of which category is where in the factor.

Note: Releveling becomes necessary when you need a specific category in a factor to be the “base” category, i.e. category that is equal to 1. One example would be if you need the “control” to be the “base” in a given RNA-seq experiment.


Exercise

Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.

samplegroup
samplegroup <- factor(samplegroup,levels = c("KO","CTL","OE"))
str(samplegroup)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,132评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,802评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,566评论 0 338
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,858评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,867评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,695评论 1 282
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,064评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,705评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 42,915评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,677评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,796评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,432评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,041评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,992评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,223评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,185评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,535评论 2 343

推荐阅读更多精彩内容