immunarch是一个R包,旨在成为分析T细胞受体(TCR)和B细胞受体(BCR)的工具箱,主要针对医学科学家和生物信息学家。immunarch的任务是尽可能轻松的进行免疫测序数据分析,帮助你专注于研究而非编程。
为什么是immunarch
- 可分析任何类型的数据:single-cell, bulk, data tables, databases 。
- 以社区为中心:在这个由全世界近3万名研究人员和医学科学家组成的社区中提问、分享知识、茁壮成长。辉瑞(Pfizer)、诺华(Novartis)、Regeneron、斯坦福(Stanford)、加州大学旧金山分校(UCSF)和麻省理工学院(MIT)都信任我们。
- 一个函数一张图:用8行代码写一篇完整的博士论文,或者用5-10行免疫代码复制几乎所有的出版物图表。
- 走在科学前沿:我们定期用最新的方法更新免疫学。让我们知道你需要什么!
- 自动格式检测和分析所有流行的免疫测序格式:从MiXCR和ImmunoSEQ到10XGenomics和ArcherDX 。
immunarch是一个集成的工具,安装的时候同时会安装用于数据分析和可视化的所有广泛使用的R包,因此需要安装相当多的依赖项。一旦完成安装,你只需要一条命令就可以得到AIRR数据分析框架和完整的数据科学软件包生态系统,immunarch将成为单细胞和免疫系统库数据科学的起点。
目前,immunarch已经托管在CRAN上,性能已经基本稳定,你可以这样安装。
install.packages("immunarch")
当然,你也可以安装开发版本:
install.packages("devtools") # skip this if you already installed devtools
devtools::install_github("immunomind/immunarch")
快速开始
我一直认为从示例学习是比较快的,而使用immunarch也是一件轻松愉悦的事,下面将演示如何快速开始。
- 载入包和示例数据
library(immunarch)
data(immdata)
载入示例数据当然简单,关键事能不能导入我们自己的数据。immunarch可以直接导入的数据有:
- "immunoseq" - ImmunoSEQ of any version. http://www.adaptivebiotech.com/immunoseq
- "mitcr" - MiTCR. https://github.com/milaboratory/mitcr
- "mixcr" - MiXCR (the "all" files) of any version. https://github.com/milaboratory/mixcr
- "migec" - MiGEC. http://migec.readthedocs.io/en/latest/
- "migmap" - For parsing IgBLAST results postprocessed with MigMap. https://github.com/mikessh/migmap
- "tcr" - tcR, our previous package. https://imminfo.github.io/tcr/
- "vdjtools" - VDJtools of any version. http://vdjtools-doc.readthedocs.io/en/latest/
- "imgt" - IMGT HighV-QUEST. http://www.imgt.org/HighV-QUEST/
- "airr" - adaptive immune receptor repertoire (AIRR) data format. http://docs.airr-community.org/en/latest/datarep/overview.html
- "10x" - 10XGenomics clonotype annotations tables. https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation
- "archer" - ArcherDX clonotype tables. https://archerdx.com/
你可以通过?repLoad
获得更详细的数据导入信息。作为快速入门的示例,我们看看数据格式如何:
immdata$meta
# A tibble: 12 x 6
Sample ID Sex Age Status Lane
<chr> <chr> <chr> <dbl> <chr> <chr>
1 A2-i129 C1 M 11 C A
2 A2-i131 C2 M 9 C A
3 A2-i133 C4 M 16 C A
4 A2-i132 C3 F 6 C A
5 A4-i191 C8 F 22 C B
6 A4-i192 C9 F 24 C B
7 MS1 MS1 M 12 MS C
8 MS2 MS2 M 30 MS C
9 MS3 MS3 M 8 MS C
10 MS4 MS4 F 14 MS C
11 MS5 MS5 F 15 MS C
12 MS6 MS6 F 15 MS C
immdata$data$`A2-i129`
# A tibble: 6,532 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int>
1 173 0.0204 TGCGCC~ CASSQE~ TRBV4~ TRBD1 TRBJ2~ 16
2 163 0.0192 TGCGCC~ CASSYR~ TRBV4~ TRBD1 TRBJ2~ 11
3 66 0.00776 TGTGCC~ CATSTN~ TRBV15 TRBD1 TRBJ2~ 11
4 54 0.00635 TGTGCC~ CATSIG~ TRBV15 TRBD2 TRBJ2~ 11
5 48 0.00565 TGTGCC~ CASSPW~ TRBV27 TRBD1 TRBJ1~ 11
6 48 0.00565 TGCGCC~ CASQGD~ TRBV4~ TRBD1 TRBJ1~ 8
7 40 0.00471 TGCGCC~ CASSQD~ TRBV4~ TRBD1 TRBJ2~ 16
8 31 0.00365 TGTGCC~ CASSEE~ TRBV2 TRBD1 TRBJ1~ 15
9 30 0.00353 TGCGCC~ CASSQP~ TRBV4~ TRBD1 TRBJ2~ 14
10 28 0.00329 TGTGCC~ CASSWV~ TRBV6~ TRBD1 TRBJ2~ 12
# ... with 6,522 more rows, and 7 more variables:
# D.start <int>, D.end <int>, J.start <int>, VJ.ins <dbl>,
# VD.ins <dbl>, DJ.ins <dbl>, Sequence <lgl>
可见,它是为每个样本准备了一个数据框,包含哪些信息呢?
colnames(immdata$data$`A2-i129`)
[1] "Clones" "Proportion" "CDR3.nt" "CDR3.aa"
[5] "V.name" "D.name" "J.name" "V.end"
[9] "D.start" "D.end" "J.start" "VJ.ins"
[13] "VD.ins" "DJ.ins" "Sequence"
- 基本数据统计和可视化
repExplore(immdata$data, "lens") %>% vis() # Visualise the length distribution of CDR3
repClonality(immdata$data, "homeo") %>% vis() # Visualise the relative abundance of clonotypes
- 探索和比较T细胞/B细胞的序列
repOverlap(immdata$data) %>% vis() # Build the heatmap of public clonotypes shared between repertoires
geneUsage(immdata$data[[1]]) %>% vis() # Visualise the V-gene distribution for the first repertoire
repDiversity(immdata$data) %>% vis(.by = "Status", .meta = immdata$meta) # Visualise the Chao1 diversity of repertoires, grouped by the patient status
你可以用?repDiversity
来查看immunarch 能帮你计算多少种多样性指标。
Pick a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref, d50, dxx.
在repDiversity函数中实现了对曲目多样性估计的几种方法。与上述函数相似的。method参数设置了多样性估计的方法。你可以选择以下方法之一:
-
Chao1
estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population). -
Hill
numbers are a mathematically unified family of diversity indices (differing only by an exponent q). -
div
- True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant. -
gini.simp
- The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types. -
inv.simp
- Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest. -
gini
- The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income). -
raref
- Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.