『生信技能树』R语言20题之我的答案版本

SO HOT!


原题目在这里

1. 安装一些R包

1.1 安装

1.1.1 ALL

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("ALL")

1.1.2 pasilla

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
>
> BiocManager::install("pasilla")

1.1.3 airway

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
>
> BiocManager::install("airway")

1.1.4 limma

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("limma")

1.1.5 DESeq2

> if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")

> BiocManager::install("DESeq2")

1.1.6 clusterProfiler

> if (!requireNamespace("BiocManager", quietly = TRUE))
+     install.packages("BiocManager")
> 
> BiocManager::install("clusterProfiler")

1.1.7 reshape2

> install.packages('reshape2')

1.1.8 ggplot2

> install.packages('ggplot2')

1.2 运行

library()检查已安装的包是否能够运行

e.g.

> library(CLL)
# 载入需要的程辑包:affy
# 载入需要的程辑包:BiocGenerics
# 载入需要的程辑包:parallel
# 
# 载入程辑包:‘BiocGenerics’
# 
# The following objects are masked from ‘package:parallel’:
#   
#   clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
# clusterExport, clusterMap, parApply, parCapply, parLapply,
# parLapplyLB, parRapply, parSapply, parSapplyLB
# 
# The following objects are masked from ‘package:stats’:
#   
#   IQR, mad, sd, var, xtabs
# 
# The following objects are masked from ‘package:base’:
#   
#   anyDuplicated, append, as.data.frame, basename, cbind, colnames,
# dirname, do.call, duplicated, eval, evalq, Filter, Find, get,
# grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match,
# mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
# rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
# tapply, union, unique, unsplit, which, which.max, which.min
# 
# 载入需要的程辑包:Biobase
# Welcome to Bioconductor
# 
# Vignettes contain introductory material; view with
# 'browseVignettes()'. To cite Bioconductor, see
# 'citation("Biobase")', and for packages 'citation("pkgname")'.

若试图运行一个没有安装的包:

> available.packages()
#                              Package                         Version         
# A3                            "A3"                            "1.0.0" 
> library(A3)
# Error in library(A3) : 不存在叫‘A3’这个名字的程辑包

2. 了解ExpressionSet对象

比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表达矩阵(使用exprs函数),查看其大小。

提取表达矩阵:

> library('CLL')
> data("sCLLex")
> exprSet <- exprs(sCLLex)

查看大小:

> dim(exprSet)
# [1] 12625    22
> str(exprSet)
#  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

3. 了解 str,head,help函数,作用于 第二步提取到的表达矩阵

3.1 str()

> str(exprSet)
#  num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object. It calls args for (non-primitive) function objects.

用于查看矩阵的大小(列数、行数),查看列名、行名的示例。

3.2 head()

> head(exprSet)
#          CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL CLL16.CEL
# 1000_at    5.743132  6.219412  5.523328  5.340477  5.229904  4.920686
# 1001_at    2.285143  2.291229  2.287986  2.295313  2.662170  2.278040
# 1002_f_at  3.309294  3.318466  3.354423  3.327130  3.365113  3.568353
# 1003_s_at  1.085264  1.117288  1.084010  1.103217  1.074243  1.073097
# 1004_at    7.544884  7.671801  7.474025  7.152482  6.902932  7.368660
# 1005_at    5.083793  7.610593  7.631311  6.518594  5.059087  4.855161
# ...

Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

用于查看对象的前几行,默认为前6行,可通过 n = xL 调整。

3.3 help()

> help("exprs")
R Documentation

help is the primary interface to the help systems.

用于调出 R Documentation 对某个函数的说明页面,?fuctionhelp() 的简略写法

4. 安装并了解 hgu95av2.db 包

看看 ls(“package:hgu95av2.db”) 后显示的哪些变量?

安装:

> if (!requireNamespace("BiocManager", quietly = TRUE))
+ install.packages("BiocManager")
>  
> BiocManager::install("hgu95av2.db")
> library(hgu95av2.db)
> ls("package:hgu95av2.db")
# [1] "hgu95av2"              "hgu95av2.db"           "hgu95av2_dbconn"     
# [4] "hgu95av2_dbfile"       "hgu95av2_dbInfo"       "hgu95av2_dbschema"   
# [7] "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"         
# [10] "hgu95av2CHRLENGTHS"    "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND" 
# [13] "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"   
# [16] "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"   
# [19] "hgu95av2GO"            "hgu95av2GO2ALLPROBES"  "hgu95av2GO2PROBE"   
# [22] "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"       
# [25] "hgu95av2ORGANISM"      "hgu95av2ORGPKG"        "hgu95av2PATH"       
# [28] "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"       
# [31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"     
# [34] "hgu95av2SYMBOL"        "hgu95av2UNIGENE"       "hgu95av2UNIPROT"   
> as.list(hgu95av2GO[1]
# $`1000_at`$`GO:0097110`
# $`1000_at`$`GO:0097110`$GOID
# [1] "GO:0097110"

# $`1000_at`$`GO:0097110`$Evidence
# [1] "IEA"

# $`1000_at`$`GO:0097110`$Ontology
# [1] "MF"

1000_at为探针,分别显示GOID, Evidence(基因注释证据代码), Ontology(基因本体论中的三个分类,BP = Biological Process; CC = Cellular Component; MF = Molecular Function).

> as.list(hgu95av2UNIGENE[1])
# $`1000_at`
# [1] "Hs.861"

"Hs.861"为NCBI UniGene数据库的ID.

提取其中的数据:

> get(featureNames(sCLLex)[1],hgu95av2GO)[1]
# $`GO:0000165`
# $`GO:0000165`$GOID
# [1] "GO:0000165"

# $`GO:0000165`$Evidence
# [1] "NAS"

# $`GO:0000165`$Ontology
# [1] "BP"

5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法

找到TP53基因对应的探针ID

> head(toTable(hgu95av2SYMBOL))
#    probe_id  symbol
# 1   1000_at   MAPK3
# 2   1001_at    TIE1
# 3 1002_f_at CYP2C19
# 4 1003_s_at   CXCR5
# 5   1004_at   CXCR5
# 6   1005_at   DUSP1

找到TP53基因对应的探针ID:

> ids <- toTable(hgu95av2SYMBOL)
> grep("TP53",ids$symbol)
# [1]   732   884   966   997  1420  2675  3322  4120  4304  5475  5526 10084
> ids[grep("TP53",ids$symbol),]
#         probe_id  symbol
# 732      1711_at TP53BP1
# 884      1860_at TP53BP2
# 966      1939_at    TP53
# 997    1974_s_at    TP53
# 1420    31618_at    TP53
# 2675    33025_at TP53TG5
# 3322    33749_at TP53TG1
# 4120    34629_at TP53I11
# 4304    34822_at TP53BP2
# 5475    36079_at  TP53I3
# 5526    36136_at TP53I11
# 10084 40986_s_at TP53BP1

6. 理解探针与基因的对应关系

总共多少个基因,基因最多对应多少个探针,是哪些基因,是不是因为这些基因很长,所以在其上面设计多个探针呢?

> summary(hgu95av2SYMBOL)
# SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")
# |
# | Lkeyname: probe_id (Ltablename: probes)
# |    Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)
# |
# | Rkeyname: symbol (Rtablename: gene_info)
# |    Rkeys: "A1BG", "A2M", ... (total=61468/mapped=8585)
# |
# | direction: L --> R

共有61468个基因,对应上8585个。

> table(ids$symbol) %>% sort() %>% tail()

# YME1L1  GAPDH INPP4A    MYB PTGER3  STAT1 
#     7      8      8      8      8      8 

列出 ids 的 'symbol' 列及每个元素出现的个数,排序,显示最后6个;

得到对应最多探针的基因。

基因长度与设计在上面的探针数量无关。

7. 第二步提取到的表达矩阵,找到那些不在 hgu95av2.db 包收录的对应着SYMBOL的探针

> mapped_probes <- mappedkeys(hgu95av2SYMBOL)
> exprSetpb <- row.names(exprSet)
> length(grep("FALSE",(exprSetpb %in% mapped_probes)))
# [1] 1165
> grep("FALSE",(exprSetpb %in% mapped_probes))
# [1]     8    52    98    99   100   109   115   117   128   135   157   168
# [13]   169   171   174   195   197   199   203   219   225   282   283   292
# ...
> head(exprSetpb[index])
# [1] "1007_s_at" "1047_s_at" "1089_i_at" "108_g_at"  "1090_f_at" "1099_s_at"

不确定对不对....

8. 过滤表达矩阵

删除那1165个没有对应基因名字的探针。

> e = exprSet[rownames(exprSet) %in% mapped_probes,]
> str(exprSet)
# num [1:12625, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:12625] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
> str(e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...

9. 整合表达矩阵

多个探针对应一个基因的情况下,只保留在所有样本里面平均表达量最大的那个探针。

> maxid = by(e,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))])
# 按照 ids$symbol 将 e 分组,取行平均值最大的行 的 行名
# 但,为什么要用 symbol 分组呢 😂
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> maxid = by(e,ids$probe_id,function(x) rownames(x)[which.max(rowMeans(x))])
> uniid = as.character(maxid)
> uni_e = e[rownames(e) %in% uniid,]
> str(uni_e)
# num [1:11460, 1:22] 5.74 2.29 3.31 1.09 7.54 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:11460] "1000_at" "1001_at" "1002_f_at" "1003_s_at" ...
#  ..$ : chr [1:22] "CLL11.CEL" "CLL12.CEL" "CLL13.CEL" "CLL14.CEL" ...
# emmmmm 结局是一样的,ok我懂了

10. 把过滤后的表达矩阵更改行名为基因的symbol

因为这个时候探针和基因是一对一关系了。

> rownames(uni_e) = ids[match(rownames(uni_e),ids$probe_id),2]
# 将ids的第2列(symbol)对应到 uni_e 的行名
> library(reshape2)
> m_e = melt(uni_e)
> colnames(m_e) = c('symbol','sample','value')
reshape2_melt()

分组:

> pd = pData(sCLLex)
> Disease = pd[,2]
> table(Disease)
# Disease
# progres.   stable 
#      14        8 
> m_e$group = rep(Disease,each=nrow(uni_e))

11. 对第10步得到的表达矩阵进行探索

先画第一个样本的所有基因的表达量的boxplot,hist,density,然后画所有样本的这些图

> suppressPackageStartupMessages(library(ggplot2))
> ggplot(m_e,aes(x=sample,y=value,fill=group))+geom_boxplot()
ggplot_box
> ggplot(m_e,aes(value,fill=group)) + geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)
ggplot_facetwrap
> ggplot(m_e,aes(value,col=group)) + geom_density()

[图片上传失败...(image-e00840-1559916805026)]

12. 理解统计学指标mean,median,max,min,sd,var,mad

并计算出每个基因在所有样本的这些统计学指标,最后按照mad值排序,取top 50 mad值的基因,得到列表。(注意:这个题目出的并不合规,请仔细看。)

12.1 mean

算术平均值

> meanlist=apply(uni_e, 1, mean)
> head(meanlist)
#  UTF1        EVX1      ADGRB1        SYN1      CHRNB4        ARR3 
# -0.14880683 -0.13705268  0.05817044  0.07589677  0.08317654  0.20826165 

12.2 median

中位数

> mdlist=apply(uni_e, 1, median)
> head(mdlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 5.368993 2.333562 3.378740 7.402258 5.902370 3.320628

12.3 max

最大值

> maxlist=apply(uni_e, 1, max)
> head(maxlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 6.251678 2.662170 3.798335 8.802729 9.919125 3.574332 

12.4 min

最小值

> minlist=apply(uni_e, 1, min)
> head(minlist)
#    MAPK3     TIE1  CYP2C19    CXCR5    DUSP1    MMP10 
# 4.826131 2.222276 3.247276 6.456285 4.097580 3.213493 

12.5 sd

标准差

> sdlist=apply(uni_e, 1, sd)
> head(sdlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.36652641 0.09046121 0.12801488 0.58908623 1.73368583 0.10074804 

12.6 var

方差

> varlist=apply(uni_e, 1, var)
> head(varlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.13434161 0.00818323 0.01638781 0.34702258 3.00566656 0.01015017 

12.7 mad

绝对中位差Median Absolute Deviation

> madlist=apply(uni_e, 1, mad)
> head(madlist)
#      MAPK3       TIE1    CYP2C19      CXCR5      DUSP1      MMP10 
# 0.22779776 0.06516670 0.08294023 0.38497146 1.43854685 0.09476130 

top 50 mad值的基因:

> tail(sort(madlist),50)
#     PFN2   TNFSF9 ARHGAP44   P2RY14  THEMIS2      LPL    ANXA4    DUSP6 
# 1.481294 1.485155 1.488032 1.505107 1.507287 1.532212 1.533327 1.551320 
#    DUSP5     H1FX     FLNA   CLEC2B   TSPYL2   ZNF266   S100A9    NR4A2 
# 1.553782 1.557412 1.574436 1.578055 1.582430 1.587748 1.608285 1.612875 
#    TGFBI     ARF6    APBB2     VCAN    RBM38     CAPG   PLXNC1     RGS2 
# 1.643149 1.654548 1.674443 1.681098 1.703638 1.713747 1.718297 1.770151 
#   RNASE6    VAMP5     CYBB     GNLY     CCL3     OAS1    TRIB2  ZNF804A 
# 1.775430 1.791017 1.811973 1.814364 1.862311 1.883509 1.937294 1.986163 
#      IGH    PCDH9    VIPR1   COBLL1  GUSBP11   S100A8      HBB   LHFPL2 
# 2.090866 2.144223 2.171912 2.179666 2.204212 2.220970 2.267136 2.317045 
#     FCN1    ZAP70    IGLC1   LGALS1      FOS   SLAMF1     TCF7      DMD 
# 2.371590 2.579046 2.590895 2.600604 2.938078 2.944105 2.993352 3.071541 
#  IGF2BP3   FAM30A 
# 3.234011 3.982191 

13. 根据第12步骤得到top 50 mad值的基因列表来取表达矩阵的子集

并且热图可视化子表达矩阵。试试看其它5种热图的包的不同效果。

取子集:

> top50gene=tail(sort(madlist),50)
> top50gene=as.data.frame(top50gene)
> top50genelist=rownames(top50gene)
> top50matrix=uni_e[top50genelist,]
> ct=scale(top50matrix,center=T,scale=F)  ## 中心化
> nmlztop50matrix=scale(ct,center=T,scale=T) ## 标准化
> pheatmap::pheatmap(nmlztop50matrix)
pheatmap
> heatmap(nmlztop50matrix)
heatmap
> ggplot(ml_nmtop50,aes(sample,symbol))+
+   geom_tile(aes(fill=value),colour = "white")+
+   scale_fill_gradient(low = "white",high = "steelblue")
ggplot_heatmap
> library(gplots)
> heatmap.2(nmlztop50matrix,key = T,symkey = FALSE,density.info="none",trace="none")
gplots_heatmap.2
> library(lattice)
> library(latticeExtra)
> x  <- t(as.matrix(scale(nmlztop50matrix)))
> dd.row <- as.dendrogram(hclust(dist(x)))
> row.ord <- order.dendrogram(dd.row)
> levelplot(t(nmlztop50matrix),aspect = "fill",xlab="sample",ylab="symbol",colorkey = list(space = "left"),legend=list(right=list(fun=dendrogramGrob,args=list(x=dd.row,rod=row.ord,side='right',size=5))))
> levelplot(t(nmlztop50matrix),aspect =
+             "fill",xlab="sample",ylab="symbol",colorkey =
+             list(space = "left"),legend=
+             list(right=list(fun=dendrogramGrob,args=
+                               list(x=dd.row,rod=row.ord,side='right',size=5))))
lattice_levelplot

emmm PDF示例里的代码跑出来top50matirx里是很多基因的重复,so这是我琢磨出来的结果

14. 取不同统计学指标mean,median,max,mean,sd,var,mad的各top50基因列表

使用UpSetR包来看他们之间的overlap情况。

各列表:

> e_mean=tail(sort(meanlist),50)
> e_md=tail(sort(mdlist),50)
> e_max=tail(sort(maxlist),50)
> e_min=tail(sort(minlist),50)
> e_sd=tail(sort(sdlist),50)
> e_var=tail(sort(varlist),50)
> e_mad=tail(sort(madlist),50)
> suppressPackageStartupMessages(library("UpSetR"))
> library(magrittr)
> e_all = data.frame(all,
+                    e_mean=ifelse(all %in% names(e_mean),1,0),
+                    e_md=ifelse(all %in% names(e_md),1,0),
+                    e_max=ifelse(all %in% names(e_max),1,0),
+                    e_min=ifelse(all %in% names(e_min),1,0),
+                    e_sd=ifelse(all %in% names(e_sd),1,0),
+                    e_var=ifelse(all %in% names(e_var),1,0),
+                    e_mad=ifelse(all %in% names(e_mad),1,0)
+ )
> upset(e_all,nsets = 7,sets.bar.color = "#BD1111")
UpSetR_magrittr

15. 在第二步的基础上面提取CLL包里面的data(sCLLex) 数据对象的样本的表型数据。

> pd=pData(sCLLex)
> grouplist=as.character(pd[,2])
> table(grouplist)
# grouplist
# progres.   stable 
#       14        8 

16. 对所有样本的表达矩阵进行聚类并且绘图

然后添加样本的临床表型数据信息(更改样本名)

> clust = t(exprSet)
> rownames(clust) = colnames(exprSet)
> View(clust)
> View(exprSet)
> clust_dist = dist(clust,method = "euclidean")
> hc = hclust(clust_dist,"ward.D")
> suppressPackageStartupMessages(library(factoextra))
> fviz_dend(hc, k = 4 ,cex = 0.5,k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),color_labels_by_k = TRUE, rect = TRUE)
hclust_fivzdend

17.对所有样本的表达矩阵进行PCA分析并且绘图,添加表型信息

> pca_data <- prcomp(t(exprSet),scale=TRUE)
> pcx <- data.frame(pca_data$x)
> pcr <- cbind(samples=rownames(pcx),grouplist, pcx) 
> ggplot(pcr, aes(PC1, PC2))+geom_point(size=5, aes(color=grouplist)) +
+   geom_text(aes(label=samples),hjust=-0.1, vjust=-0.3)
procmp_ggplot

18. 根据表达矩阵及样本分组信息进行批量T检验,得到检验结果表格

分组,求p值

> gl = as.factor(grouplist)
> group1 = which(grouplist == levels(gl)[1])
> group2 = which(grouplist == levels(gl)[2])
# e = exprSet[rownames(exprSet) %in% mapped_probes,]
> et1 = e[,group1]
> et2 = e[,group2]
> data_t = cbind(et1,et2)
> pvals = apply(e, 1, function(x){
+   t.test(as.numeric(x)~grouplist)$p.value
+ })    ## p值
> p.adj = p.adjust(pvals, method = "BH")  ## 校正后的p值

求log2FC, 做批量t检验

> data_mean_c = rowMeans(et1) ## control
> data_mean_t = rowMeans(et2) ## treatment
> log2FC = data_mean_t-data_mean_c
> DEG_t.test = cbind(data_mean_c, data_mean_t, log2FC, pvals, p.adj)
> DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),] ##排序
> DEG_t.test=as.data.frame(DEG_t.test)
> head(DEG_t.test)
#         data_mean_c data_mean_t     log2FC        pvals     p.adj
# 36129_at    7.875615    8.791753  0.9161377 1.629755e-05 0.1867699
# 37676_at    6.622749    7.965007  1.3422581 4.058944e-05 0.2211373
# 33791_at    7.616197    5.786041 -1.8301554 6.965416e-05 0.2211373
# 39967_at    4.456446    2.152471 -2.3039752 8.993339e-05 0.2211373
# 34594_at    5.988866    7.058738  1.0698718 9.648226e-05 0.2211373
# 32198_at    4.157971    3.407405 -0.7505660 2.454557e-04 0.3192169

19. 使用limma包对表达矩阵及样本分组信息进行差异分析,得到差异分析表格

重点看logFC和P值,画火山图(就是logFC和-log10(P值)的散点图)。

> suppressPackageStartupMessages(library(limma))
> design1=model.matrix(~factor(grouplist))  ## 设计矩阵
> colnames(design1)=levels(factor(grouplist))
> rownames(design1)=colnames(exprSet)
> fit1 = lmFit(exprSet,design1)
> fit1=eBayes(fit1)
> options(digits = 3) 
> mtx1 = topTable(fit1,coef=2,adjust='BH',n=Inf) 
> DEG_mtx1 = na.omit(mtx1) ##去除缺失值
> head(DEG_mtx1)
#           logFC AveExpr     t  P.Value adj.P.Val    B
# 39400_at  1.028    5.62  5.84 8.34e-06    0.0334 3.23
# 36131_at -0.989    9.95 -5.77 9.67e-06    0.0334 3.12
# 33791_at -1.830    6.95 -5.74 1.05e-05    0.0334 3.05
# 1303_at   1.384    4.46  5.73 1.06e-05    0.0334 3.04
# 36122_at -0.780    7.26 -5.14 4.21e-05    0.1062 1.93
# 36939_at -2.547    6.92 -5.04 5.36e-05    0.1128 1.74

画图:

> DEG=DEG_mtx1
> logFC_cutoff <- with(DEG,mean(abs(logFC)) + 2*sd(abs(logFC)) ) 
> DEG$result = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff,  ## p.Value<0.05,|logFC|>1
+                               ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))
> title <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), #round保留小数位数
+                     
+                     '\nThe number of up gene is ',nrow(DEG[DEG$result =='UP',]) ,
+                     
+                     '\nThe number of down gene is ',nrow(DEG[DEG$result =='DOWN',])
+ )
> ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=result)) +
+   geom_point(alpha=0.4, size=1.75) +
+   theme_set(theme_set(theme_bw(base_size=20)))+
+   xlab("log2 fold change") + ylab("-log10 p-value") +
+   ggtitle( title ) + theme(plot.title = element_text(size=15,hjust = 0.5))+
+   scale_colour_manual(values = c('blue','black','red'))
ggplot_volcano

20.对T检验结果的P值和limma包差异分析的P值画散点图,看看哪些基因相差很大?

> DEG_t.test = DEG_t.test[rownames(DEG_mtx1),]
> colnames(DEG_t.test)
# [1] "data_mean_c" "data_mean_t" "log2FC"      "pvals"      
# [5] "p.adj"      
> colnames(DEG_mtx1)
# [1] "logFC"     "AveExpr"   "t"         "P.Value"   "adj.P.Val"
# [6] "B"        

画图:

> plot(DEG_t.test[,4],DEG_mtx1[,4])
> plot(-log10(DEG_t.test[,4]),-log10(DEG_mtx1[,4]))
ggplot_dot
ggplot_dot_lg

最后,向大家隆重推荐生信技能树的一系列干货!

  1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
  2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
  3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,558评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,002评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,036评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,024评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,144评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,255评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,295评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,068评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,478评论 1 305
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,789评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,965评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,649评论 4 336
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,267评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,982评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,223评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,800评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,847评论 2 351

推荐阅读更多精彩内容