R语言基础知识的一些检验,最好是对照几本R基础语法书籍来理解。
全部答案及视频在:https://github.com/jmzeng1314/R_bilibili
首先做完了周末班工作, 包括软件安装以及R包安装:http://www.bio-info-trainee.com/3727.html
1. 打开 Rstudio 告诉我它的工作目录。
> getwd()
[1] "C:/Users/wlx/Documents"
2.新建6个向量,基于不同的原子类型。(重点是字符串,数值,逻辑值)
> a <- c(1:10)
> b <- c("a","b","c","d")
> c <- c(T,F,F,T)
> a
[1] 1 2 3 4 5 6 7 8 9 10
> b
[1] "a" "b" "c" "d"
> c
[1] TRUE FALSE FALSE TRUE
> a <- c(1, 2, 5, 3, 6, -2, 4)
> b <- c("one", "two", "three")
> c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
> a
[1] 1 2 5 3 6 -2 4
> b
[1] "one" "two" "three"
> c
[1] TRUE TRUE TRUE FALSE TRUE FALSE
>
3 告诉我在你打开的rstudio里面 getwd() 代码运行后返回的是什么?
> getwd()
[1] "C:/Users/wlx/Documents"
4新建一些数据结构,比如矩阵,数组,数据框,列表等重点是数据框,矩阵)
4.1 矩阵
> cnames <- c("C1", "C2")
> rnames <- c("g1", "g2")
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
+ dimnames=list(rnames, cnames))
> mymatrix
C1 C2
g1 1 26
g2 24 68
> mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE,
+ dimnames=list(rnames, cnames))
> mymatrix
C1 C2
g1 1 24
g2 26 68
4.2 数组
dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
4.3 数据框
patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
4.4 因子
diabetes <- c("Type1", "Type2", "Type1", "Type1")
> diabetes
[1] "Type1" "Type2" "Type1" "Type1"
> diabetes <- factor(diabetes)
> diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> status <- c("Poor", "Improved", "Excellent", "Poor")
> status
[1] "Poor" "Improved" "Excellent" "Poor"
> status <- factor(status, ordered=TRUE)
> status
[1] Poor Improved Excellent Poor
Levels: Excellent < Improved < Poor
> status <- factor(status, order=TRUE,
+ levels=c("Poor", "Improved", "Excellent"))
> status
[1] Poor Improved Excellent Poor
Levels: Poor < Improved < Excellent
> sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))
Error in factor(sex, levels = c(1, 2), labels = c("Male", "Female")) :
object 'sex' not found
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata)
'data.frame': 4 obs. of 4 variables:
$ patientID: num 1 2 3 4
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
4.5 list
> g <- "My First List"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow=5)
> k <- c("one", "two", "three")
> mylist <- list(title=g, ages=h, j, k)
> mylist
$title
[1] "My First List"
$ages
[1] 25 26 18 39
[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
[[4]]
[1] "one" "two" "three"
> mylist[[2]]
[1] 25 26 18 39
> mylist[["ages"]]
[1] 25 26 18 39
提醒程序员注意的一些事项
经验丰富的程序员通常会发现R语言的某些方面不太寻常。以下是这门语言中你需要了解
的一些特性。
❑对象名称中的句点(.)没有特殊意义,但美元符号(x是指数据框A中的变
量x。
❑R不提供多行注释或块注释功能。你必须以#作为多行注释每行的开始。出于调试目的,
你也可以把想让解释器忽略的代码放到语句if(FALSE){... }中。将FALSE改为TRUE
即允许这块代码执行。
❑将一个值赋给某个向量、矩阵、数组或列表中一个不存在的元素时,R将自动扩展这个
数据结构以容纳新值。举例来说,考虑以下代码:
> x <- c(8, 6, 4)
> x[7] <- 10
> x
[1] 8 6 4 NA NA NA 10
通过赋值,向量x由三个元素扩展到了七个元素。x <- x[1:3]会重新将其缩减回三个
元素。
❑R中没有标量。标量以单元素向量的形式出现。
❑R中的下标不从0开始,而从1开始。在上述向量中,x[1]的值为8。
❑变量无法被声明。它们在首次被赋值时生成。
5. 在你新建的数据框进行切片操作,比如首先取第1,3行, 然后取第4,6列
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
> df[c(1,3), ]
Error in df[c(1, 3), ] : object of type 'closure' is not subsettable
> patientdata[c(1,3), ]
patientID age diabetes status
1 1 25 Type1 Poor
3 3 28 Type1 Excellent
> patientdata[c(1:3), ]
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
> patientdata[,c(2,4) ]
age status
1 25 Poor
2 34 Improved
3 28 Excellent
4 52 Poor
5. 使用data函数来加载R内置数据集 rivers 描述它。并且可以查看更多的R语言内置的数据集:https://mp.weixin.qq.com/s/dZPbCXccTzuj0KkOL7R31g
> data("rivers")
> class(rivers)
[1] "numeric"
> str(rivers)
num [1:141] 735 320 325 392 524 ...
> length(rivers)
[1] 141
> summary(rivers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
135.0 310.0 425.0 591.2 680.0 3710.0
> head(rivers); tail(rivers)
[1] 735 320 325 392 524 450
[1] 500 720 270 430 671 1770
6下载 https://www.ncbi.nlm.nih.gov/sra?term=SRP133642 里面的 RunInfo Table 文件读入到R里面,了解这个数据框,多少列,每一列都是什么属性的元素。(参考B站生信小技巧获取runinfo table) 这是一个单细胞转录组项目的数据,共768个细胞,如果你找不到RunInfo Table 文件,可以点击下载,然后读入你的R里面也可以。
7下载 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111229 里面的样本信息sample.csv读入到R里面,了解这个数据框,多少列,每一列都是什么属性的元素。(参考 https://mp.weixin.qq.com/s/fbHMNXOdwiQX5BAlci8brA 获取样本信息sample.csv)如果你实在是找不到样本信息文件sample.csv,也可以点击下载。
> aread.table("SraRunTable.txt",sep="\t",header=T,check.names=F)
Error in aread.table("SraRunTable.txt", sep = "\t", header = T, check.names = F) :
could not find function "aread.table"
> a<- read.table("SraRunTable.txt",sep="\t",header=T,check.names=F)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'SraRunTable.txt': No such file or directory
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt",sep="\t",header=T,check.names=F)
> b<- b <- read.csv("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> dim(a)
[1] 768 31
> dim(b)
[1] 768 1
> b<- b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv",sep="\t",header=T,check.names=F)
> dim(b)
[1] 768 1
> b <- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\sample.csv")
> dim(b)
[1] 768 6
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 44 elements
> a<- read.table("C:\\Users\\wlx\\Desktop\\scRNA\\SraRunTable.txt",sep="\t",header=T,check.names=F)
> dim(a)
[1] 768 31
> str(a)
'data.frame': 768 obs. of 31 variables:
$ BioSample : Factor w/ 768 levels "SAMN08619908",..: 5 4 3 2 1 12 11 14 13 7 ...
$ Experiment : Factor w/ 768 levels "SRX3749901","SRX3749902",..: 2 3 4 5 6 7 8 9 10 1 ...
$ MBases : int 16 16 8 8 11 7 18 5 11 15 ...
$ MBytes : int 8 8 4 4 5 4 9 3 6 8 ...
$ Run : Factor w/ 768 levels "SRR6790711","SRR6790712",..: 1 2 3 4 5 6 7 8 9 10 ...
$ SRA_Sample : Factor w/ 768 levels "SRS3006136","SRS3006137",..: 3 13 2 1 14 5 15 7 6 4 ...
$ Sample_Name : Factor w/ 768 levels "GSM3025845","GSM3025846",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Assay_Type : Factor w/ 1 level "RNA-Seq": 1 1 1 1 1 1 1 1 1 1 ...
$ AssemblyName : Factor w/ 1 level "GCF_000001635.20": 1 1 1 1 1 1 1 1 1 1 ...
$ AvgSpotLen : int 43 43 43 43 43 43 43 43 43 43 ...
$ BioProject : Factor w/ 1 level "PRJNA436229": 1 1 1 1 1 1 1 1 1 1 ...
$ Center_Name : Factor w/ 1 level "GEO": 1 1 1 1 1 1 1 1 1 1 ...
$ Consent : Factor w/ 1 level "public": 1 1 1 1 1 1 1 1 1 1 ...
$ DATASTORE_filetype: Factor w/ 1 level "sra": 1 1 1 1 1 1 1 1 1 1 ...
$ DATASTORE_provider: Factor w/ 1 level "ncbi": 1 1 1 1 1 1 1 1 1 1 ...
$ InsertSize : int 0 0 0 0 0 0 0 0 0 0 ...
$ Instrument : Factor w/ 1 level "Illumina HiSeq 2000": 1 1 1 1 1 1 1 1 1 1 ...
$ LibraryLayout : Factor w/ 1 level "SINGLE": 1 1 1 1 1 1 1 1 1 1 ...
$ LibrarySelection : Factor w/ 1 level "cDNA": 1 1 1 1 1 1 1 1 1 1 ...
$ LibrarySource : Factor w/ 1 level "TRANSCRIPTOMIC": 1 1 1 1 1 1 1 1 1 1 ...
$ LoadDate : Factor w/ 1 level "2018-03-01": 1 1 1 1 1 1 1 1 1 1 ...
$ Organism : Factor w/ 1 level "Mus musculus": 1 1 1 1 1 1 1 1 1 1 ...
$ Platform : Factor w/ 1 level "ILLUMINA": 1 1 1 1 1 1 1 1 1 1 ...
$ ReleaseDate : Factor w/ 1 level "2018-11-23": 1 1 1 1 1 1 1 1 1 1 ...
$ SRA_Study : Factor w/ 1 level "SRP133642": 1 1 1 1 1 1 1 1 1 1 ...
$ age : Factor w/ 1 level "14 weeks": 1 1 1 1 1 1 1 1 1 1 ...
$ cell_type : Factor w/ 1 level "cancer-associated fibroblasts (CAFs)": 1 1 1 1 1 1 1 1 1 1 ...
$ marker_genes : Factor w/ 1 level "EpCAM-, CD45-, CD31-, NG2-": 1 1 1 1 1 1 1 1 1 1 ...
$ source_name : Factor w/ 1 level "Mammary tumor fibroblast": 1 1 1 1 1 1 1 1 1 1 ...
$ strain : Factor w/ 1 level "FVB/N-Tg(MMTVPyVT)634Mul/J": 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 1 level "Mammary tumor fibroblast": 1 1 1 1 1 1 1 1 1 1 ...
> str(b)
'data.frame': 768 obs. of 6 variables:
$ Accession.Title.Sample : Factor w/ 1 level "musculus\",1,GPL13112,GSE111229,\"SRA": 1 1 1 1 1 1 1 1 1 1 ...
$ Type.Taxonomy.Channels.Platform.Series.Supplementary: Factor w/ 1 level "Run": 1 1 1 1 1 1 1 1 1 1 ...
$ Types.Supplementary : Factor w/ 768 levels "Selector\",\"https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRX3749901\",SRX3749901,\"Kristian",..: 2 3 4 5 6 7 8 9 10 1 ...
$ Links.SRA : Factor w/ 1 level "Pietras\",\"Nov": 1 1 1 1 1 1 1 1 1 1 ...
$ Accession.Contact.Release : Factor w/ 1 level "23,": 1 1 1 1 1 1 1 1 1 1 ...
$ Date : Factor w/ 1 level "2018\"": 1 1 1 1 1 1 1 1 1 1 ...
把前面两个步骤的两个表(RunInfo Table 文件,样本信息sample.csv)关联起来,使用merge函数。
基于下午的统计可视化
8对前面读取的 RunInfo Table 文件在R里面探索其MBases列,包括 箱线图(boxplot)和五分位数(fivenum),还有频数图(hist),以及密度图(density) 。
把前面读取的样本信息表格的样本名字根据下划线分割看第3列元素的统计情况。第三列代表该样本所在的plate
根据plate把关联到的 RunInfo Table 信息的MBases列分组检验是否有统计学显著的差异。
分组绘制箱线图(boxplot),频数图(hist),以及密度图(density) 。
使用ggplot2把上面的图进行重新绘制。
使用ggpubr把上面的图进行重新绘制。
rm(list = ls())
options(stringsAsFactors = F)
# 或者下载:http://www.bio-info-trainee.com/tmp/5years/SraRunTable.txt
a=read.table('SraRunTable.txt',sep = '\t',header = T)
# 或者下载:http://www.bio-info-trainee.com/tmp/5years/sample.csv
b=read.csv('sample.csv')
colnames(a)
colnames(b)
d=merge(a,b,by.x = 'Sample_Name',by.y = 'Accession')
e=d[,c("MBases","Title")]
save(e,file = 'input.Rdata')
rm(list = ls())
options(stringsAsFactors = F)
load(file = 'input.Rdata')
e[,2]
plate=unlist(lapply(e[,2],function(x){
# x=e[1,2]
x
strsplit(x,'_')[[1]][3]
}))
table(plate)
boxplot(e[,1]~plate)
t.test(e[,1]~plate)
e$plate=plate
library(ggplot2)
colnames(e)
ggplot(e,aes(x=plate,y=MBases))+geom_boxplot()
library(ggpubr)
p <- ggboxplot(e, x = "plate", y = "MBases",
color = "plate", palette = "jco",
add = "jitter")
# Add p-value
p + stat_compare_means(method = 't.test')