生信学习基础_R语言02_R syntax and data structures R句法与数据结构

原文地址:https://hbctraining.github.io/Intro-to-R/lessons/02_introR-syntax-and-data-structures.html

大神的中文整理版:https://www.jianshu.com/p/95403e2cf920

本文是我拷贝的原文,加了自己的笔记和练习题答案。

Learning Objectives

  • Employ variables in R.
  • Describe the various data types used in R.
  • Construct data structures to store data.

The R syntax

Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.

Below is an example script highlighting the many different “parts of speech” for R (syntax):

  • the comments # and how they are used to document function and its content
  • variables and functions
  • the assignment operator <-
  • the = for arguments in functions

NOTE: indentation and consistency in spacing is used to improve clarity and legibility

Example script

# Load libraries
library(Biobase)
library(limma)
library(ggplot2)

# Setup directory variables
baseDir <- getwd()
dataDir <- file.path(baseDir, "data")
metaDir <- file.path(baseDir, "meta")
resultsDir <- file.path(baseDir, "results")

# Load data
meta <- read.delim(file.path(metaDir, '2015-1018_sample_key.csv'), header=T, sep="\t", row.names=1)

Assignment operator

To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to x by executing:

x <- 3

The assignment operator (<-) assigns values on the right to variables on the left.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke.

\color{red}{同时按住Alt/option和-号,会出现<-}

Variables

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value, 3.

Let’s create another variable called y and give it a value of 5.

y <- 5

When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.

y

You can also view information on the variable by looking in your Environment window in the upper right-hand corner of the RStudio interface.

_

Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation:

x + y

Try assigning the results of this operation to another variable called number.

number <- x + y


Exercises

  1. Try changing the value of the variable x to 5. What happens to number?
  2. Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?

Tips on variable names

Variables can be given almost any name, such as x, current_temperature, or subject_id. However, there are some rules / suggestions you should keep in mind:

  • Make your names explicit and not too long.
  • Avoid names starting with a number (2x is not valid but x2 is)
  • Avoid names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data) as variable names. When in doubt check the help to see if the name is already in use.
  • Avoid dots (.) within a variable name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them.
  • Use nouns for object names and verbs for function names
  • Keep in mind that R is case sensitive (e.g., genome_length is different from Genome_length)
  • Be consistent with the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are Hadley Wickham’s style guide and Google’s.

Data Types

Variables can contain values of specific types within R. The six data types that R uses include:

  • "numeric" for any numerical value
  • "character" for text values, denoted by using quotes (“”) around value
  • "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • "logical" for TRUE and FALSE (the Boolean data type)
  • "complex" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
  • "raw" that we won’t discuss further

The table below provides examples of each of the commonly used data types:

Data Type Examples
Numeric: 1, 1.5, 20, pi
Character: “anytext”, “5”, “TRUE”
Integer: 2L, 500L, -17L
Logical: TRUE, FALSE, T, F

Data Structures

We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when number was created, the result of the mathematical operation was a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c)向量, factors (factor)因子, matrices (matrix)矩阵, data frames (data.frame)数据框 and lists (list)列表.

Vectors

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s basically just a collection of values, mainly either numbers,

numeric vector

or characters,

character vector

or logical values,

logical vector

Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type.

For example, if you were to try to create the following vector:

mixed vector

R will coerce it into:

image

The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements.

Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single entity (bucket).

Let’s create a vector of genome lengths and assign it to a variable called glengths.

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

glengths <- c(4.6, 3000, 50000)
glengths

Note your environment shows the glengths variable is numeric and tells you the glengths vector starts at element 1 and ends at element 3 (i.e. your vector contains 3 values).

A vector can also contain characters. Create another vector called species with three elements, where each element corresponds with the genome sizes vector (in Mb).

species <- c("ecoli", "human", "corn")
species


Exercise

Create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this. Print the combined vector in the console, what looks different compared to the original vectors?

## exercise c就是combine的意思
combined <- c(glengths,species)
combined

Factors

A factor is a special type of vector that is used to store categorical data存储分类数据. Each unique category is referred to as a factor level (i.e. category = level). Factors are built on top of integer vectors such that each factor level is assigned an integer value, creating value-label pairs.

factors

Let’s create a factor vector and explore a bit more. We’ll start by creating a character vector describing three different levels of expression:

expression <- c("low", "high", "medium", "high", "low", "medium", "high")

Now we can convert this character vector into a factor using the factor() function:

expression <- factor(expression)

So, what exactly happened when we applied the factor() function?

factor_new

The expression vector is categorical, in that all the values in the vector belong to a set of categories; in this case, the categories are low, medium, and high. By turning the expression vector into a factor, the categories are assigned integers alphabetically类别按字母表顺序分配整数, with high=1, low=2, medium=3. This in effect assigns the different factor levels. You can view the newly created factor variable and the levels in the Environment window.

Factor variables in environment

Exercises

Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup using the code below. This vector will contain nine elements: 3 control (“CTL”) samples, 3 knock-out (“KO”) samples, and 3 over-expressing (“OE”) samples:

     samplegroup <- c("CTL", "CTL", "CTL", "KO", "KO", "KO", "OE", "OE", "OE")
    
    
  2. Turn samplegroup into a factor data structure.

## exercise 
samplegroup <- c("CTL", "CTL", "CTL", "KO", "KO", "KO", "OE", "OE", "OE")
samplegroup <- factor(samplegroup)
samplegroup

Matrix

A matrix in R is a collection of vectors of same length and identical datatype矩阵要求里面的元素都是相同的数据类型. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

matrix

Matrices are used commonly as part of the mathematical machinery of statistics. They are usually of numeric datatype and used in computational algorithms to serve as a checkpoint. For example, if input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

Data Frame

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type而数据框内元素可以是不同数据类型 (e.g., characters, integers, factors).

dataframe

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function, and giving the function the different vectors we would like to bind together. This function will only work for vectors of the same length.

df <- data.frame(species, glengths)

Beware of data.frame()’s default behaviour which turns character vectors into factors. Print your data frame to the console:

df

Upon inspection of our dataframe, we see that although the species vector was a character vector, it automatically got converted into a factor inside the data frame (the removal of quotation marks). We will show you how to change the default behavior of a function in the next lesson.

Note that you can view your data.frame object by clicking on its name in the Environment window.

Lists

Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures.

list

If you have variables of different data structures you wish to combine, you can put all of those into one list object by using the list() function and placing all the items you wish to combine within parentheses:

list1 <- list(species, df, number)

Print out the list to screen to take a look at the components:

list1

[[1]]
[1] "ecoli" "human" "corn" 

[[2]]
  species glengths
1   ecoli      4.6
2   human   3000.0
3    corn  50000.0

[[3]]
[1] 5

There are three components corresponding to the three different variables we passed in, and what you see is that structure of each is retained. Each component of a list is referenced based on the number position. We will talk more about how to inspect and manipulate components of lists in later lessons.


Exercise

Create a list called list2 containing species, glengths, and number.


list2 <- list(species, glengths, number)
list2

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

<footer class="site-footer" style="box-sizing: border-box; display: block; padding-top: 2rem; margin-top: 2rem; border-top: 1px solid rgb(239, 240, 241); font-size: 1rem; color: rgb(96, 108, 113); font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">Intro-to-R is maintained by hbctraining.This page was generated by GitHub Pages.</footer>

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,384评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,845评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,148评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,640评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,731评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,712评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,703评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,473评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,915评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,227评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,384评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,063评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,706评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,302评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,531评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,321评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,248评论 2 352

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,451评论 0 13
  • mean to add the formatted="false" attribute?.[ 46% 47325/...
    ProZoom阅读 2,695评论 0 3
  • Here is our training plan. Our agenda for this training m...
    FlyingPeter阅读 391评论 0 0
  • 每天做同样的事,说同样的话,见同样的人,把一年过成一个月,把一个月过成一天,在感慨时间去哪儿的时候,也要想想我们是...
    水彩画里的大千世界阅读 1,406评论 0 0
  • 寡言 沉默无言,如此伟岸 天也曾经是蓝 在游子眼中却是一片苍茫 这里没有宝藏 没有甘泉 没有温柔乡 人们互不相识,...
    一株仙掌阅读 162评论 0 0