单细胞免疫组学分析练习-1:cellranger multi
immunarch是一个R包,旨在成为分析T细胞受体(TCR)和B细胞受体(BCR)的工具箱
感觉主要是用来单独分析免疫组的数据,而非将免疫组与转录组结合分析
1. 安装immunarch
install.packages("immunarch")
immunarch有内置数据集
data("immdata")
#查看
immdata$data
immdata$meta
2. 导入10x数据
immunarch可以直接导入的数据有:
- "immunoseq" - ImmunoSEQ of any version. http://www.adaptivebiotech.com/immunoseq
- "mitcr" - MiTCR. https://github.com/milaboratory/mitcr
- "mixcr" - MiXCR (the "all" files) of any version. https://github.com/milaboratory/mixcr
- "migec" - MiGEC. http://migec.readthedocs.io/en/latest/
- "migmap" - For parsing IgBLAST results postprocessed with MigMap. https://github.com/mikessh/migmap
- "tcr" - tcR, our previous package. https://imminfo.github.io/tcr/
- "vdjtools" - VDJtools of any version. http://vdjtools-doc.readthedocs.io/en/latest/
- "imgt" - IMGT HighV-QUEST. http://www.imgt.org/HighV-QUEST/
- "airr" - adaptive immune receptor repertoire (AIRR) data format. http://docs.airr-community.org/en/latest/datarep/overview.html
- "10x" - 10XGenomics clonotype annotations tables. https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation
- "archer" - ArcherDX clonotype tables. https://archerdx.com/
在您的R环境中运行下面的代码,以将数据加载为Immunarch的格式。您可以在包含Cellranger输出文件的整个文件夹上运行它。repLoad将忽略不支持的文件格式。
TCR_10x <- repLoad("/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t")
== Step 1/3: loading repertoire files... ==
Processing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t" ...
-- [1/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/airr_rearrangement.tsv" -- airr
-- [2/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/cell_barcodes.json" -- -- [3/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/clonotypes.csv" -- unsupported format, skipping
-- [4/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/concat_ref.bam.bai" -- unsupported format, skipping
-- [5/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/consensus_annotations.csv" -- 10x (consensus)
-- [6/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/consensus.bam.bai" -- unsupported format, skipping
-- [7/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/filtered_contig_annotations.csv" -- 10x (filt.contigs)
-- [8/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/vdj_contig_info.pb" -- unsupported format, skipping
-- [9/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/vloupe.vloupe" -- unsupported format, skipping
== Step 2/3: checking metadata files and merging files... ==
Processing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t" ...
-- Metadata file not found; creating a dummy metadata...
== Step 3/3: processing paired chain data... ==
Done!
通过分析load的过程我们可以看到,repLoad()
可以自动识别目标文件
为了直接读取filtered_contig_annotations.csv
,也可以
TCR_10x <- repLoad("/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/filtered_contig_annotations.csv")
如果有多个样本,需要构建一个文件夹作为file_path
,把每个filtered_contig_annotations.csv
重命名为样本名并复制到该文件夹下,再做一个metadata.txt
表明分组情况。
举个例子:
> imm_10x <- repLoad("/home/user/myh/raw_data/BALBC/immunarch/inputs")
== Step 1/3: loading repertoire files... ==
Processing "/home/user/myh/raw_data/BALBC/immunarch/inputs" ...
-- [1/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/BCR.csv" -- 10x (filt.contigs)
-- [2/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/metadata.txt" -- metadata
-- [3/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/TCR.csv" -- 10x (filt.contigs) 0s
== Step 2/3: checking metadata files and merging files... ==
Processing "/home/user/myh/raw_data/BALBC/immunarch/inputs" ...
-- Everything is OK!
== Step 3/3: processing paired chain data... ==
Done!
看一看包括的内容:
> colnames(imm_10x$data$TCR)
[1] "Clones" "Proportion" "CDR3.nt" "CDR3.aa" "V.name" "D.name" "J.name"
[8] "V.end" "D.start" "D.end" "J.start" "VJ.ins" "VD.ins" "DJ.ins"
[15] "Sequence" "chain" "Barcode" "raw_clonotype_id" "ContigID"
3. Data filtering
4.基本情况展示
(1)repExplore
repExplore(imm_10x$data, "lens") %>% vis() # Visualise the length distribution of CDR3
可以按照分组情况展示,并进行统计学计算:
exp_vol <- repExplore(immdata$data, .method = "volume")
p1 <- vis(exp_vol, .by = c("Status"), .meta = immdata$meta)
p2 <- vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta)
p1 + p2
如果对默认出的图不满意,你可以用fixVis打开一个shiny窗口,一点一点调整直到可以发表。
这个真的很好用!
还可以看abundance
exp_cnt <- repExplore(immdata$data, .method = "count")
p2 <- vis(exp_cnt)
p5 <- vis(exp_cnt, .by = "Sex", .meta = immdata$meta)
(2)repClonality
克隆性,clonality(不是clonotypes)
来读一下help("repClonality“)
Description
repClonality
function encompasses several methods to measure clonal proportions in a given repertoire.
Usage
repClonality(
.data,
.method = c("clonal.prop", "homeo", "top", "rare"),
.perc = 10,
.clone.types = c(Rare = 1e-05, Small = 1e-04, Medium = 0.001, Large = 0.01,
Hyperexpanded = 1),
.head = c(10, 100, 1000, 3000, 10000, 30000, 1e+05),
.bound = c(1, 3, 10, 30, 100)
)
Arguments
| .perc
|
A single numerical value ranging from 0 to 100.
| .method
|
A String with one of the following options: "clonal.prop"
, "homeo"
, "top"
or "rare"
.
Set
"clonal.prop"
to compute clonal proportions or in other words percentage of clonotypes required to occupy specified by.perc
percent of the total immune repertoire.Set
"homeo"
to analyse relative abundance (also known as clonal space homeostasis), which is defined as the proportion of repertoire occupied by clonal groups with specific abundances.Set
"top"
to estimate relative abundance for the groups of top clonotypes in repertoire, e.g., ten most abundant clonotypes. Use".head"
to define index intervals, such as 10, 100 and so on.Set
"rare"
to estimate relative abundance for the groups of rare clonotypes with low counts. Use".bound"
to define the boundaries of clonotype groups.
| .clone.types
|
A named numerical vector with the boundaries of the half-closed intervals that mark off clonal groups.
| .head
|
A numerical vector with ranges of the top clonotypes.
| .bound
|
A numerical vector with ranges of abundance for the rare clonotypes in the dataset.
repClonality(imm_10x$data, "homeo") %>% vis() # Visualise the relative abundance of clonotypes
Top clonal proportion
imm_top <- repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000))
imm_top
imm_top %>% vis()
稀有克隆分布
imm_rare <- repClonality(immdata$data, .method = "rare")
imm_rare
imm_rare %>% vis()
相对丰度
imm_hom <- repClonality(immdata$data,
.method = "homeo",
.clone.types = c(Small = .0001, Medium = .001, Large = .01, Hyperexpanded = 1)
)
imm_hom # Visualise the relative abundance of clonotypes
vis(imm_hom) + vis(imm_hom, .by = c("Status", "Sex"), .meta = immdata$meta)
(3)repOverlap: 比较不同样本TCR/BCR repertoire的相似性
免疫组重叠(Repertoire overlap)是最常用的度量Repertoire 相似度的方法。它是通过计算在给定的Repertoire 之间共享的克隆类型的特定统计量来实现的,也称为“公共”克隆类型。immunarch 提供了几个指标:-公共克隆型数量(.method = "public")) -一个经典的重叠相似性度量。
- overlap coefficient (.method = "overlap") : 重叠相似度的标准化度量。它被定义为交集的大小除以两个集合中较小的部分。
- Jaccard index (.method = "jaccard") : 它度量有限样本集之间的相似性,定义为交集的大小除以样本集并集的大小。
- Tversky index (.method = "tversky") : 一种对集合的非对称相似度量,用来比较一个变体和一个原型。如果使用默认参数,它类似于Dice的系数。
- cosine similarity (.method = "cosine") : 两个非零向量之间的相似性度量
- Morisita’s overlap index (.method = "morisita") : 个体在总体中分散的统计方法。它用于比较样本之间的重叠。
- incremental overlap : overlaps of the N most abundant clonotypes with incrementally growing N (.method = "inc+METHOD", e.g., "inc+public" or "inc+morisita").
包含所描述方法的函数是repOverlap。同样,当输出被传递到vis()函数时,输出很容易被可视化,它完成了所有的工作。
repOverlap(immdata$data) %>% vis() # Build the heatmap of public clonotypes shared between repertoires
因为我下载的数据是单个样本的,所以用了示例数据集里的样本
还可以用热图展示
repOverlap(immdata$data) %>% vis() # Build the heatmap of public clonotypes shared between repertoires
imm_ov1 <- repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 <- repOverlap(immdata$data, .method = "morisita", .verbose = F)
p1 <- vis(imm_ov1)
p2 <- vis(imm_ov2, .text.size = 2)
p1 + p2
vis(imm_ov1, "heatmap2")
也可以用circos图展示,不过感觉就是看起来比较fancy,实际上意义并没有特别大
vis(imm_ov1, "circos")
(4)geneUsage: 比较不同基因片段的使用情况
免疫组库的多样性是由VDJ基因重排带来的,这里的VDJ是区域名称,每个区域又有不同的基因基因片段,每个TCR重排的过程就是筛选这些基因片段的过程。geneUsage指的就是这些基因的出现频率(次数)
参考资料
immunarch提供了一个基因片段数据表,包含几个物种的已知基因片段数量,按照IMGT的命名法。调用gene_stats()函数可以得到基因的当前统计信息:
如果看智人的TRBV基因,你需要使用hs.trbv
如果你计划使用Mus musmus_ighj基因,你需要使用musmus.ighj。
.gene
A character vector of length one with the name of the gene you want to analyse of the specific species. If you provide a vector of different length, only first element will be used. The string should also contain the species of interest, for example, valid ".gene" arguments are "hs.trbv", "HomoSapiens.TRBJ" or "macmul.IGHV". For details run gene_stats().
imm_gu <- geneUsage(immdata$data, "hs.trbj")#
imm_gu
p1 <- vis(imm_gu[c(1, 2)])
p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()
library(patchwork)
p1 + p2 +p3
imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)
vis(imm_gu, .grid = T)
vis(imm_gu, .by = "Status", .meta = immdata$meta)
vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")
VDJ基因之间的流向
library(ggforce)
immdata$data$`A2-i129` %>%
gather_set_data(c(6,7,5)) %>%
ggplot(aes(x, id = id, split = y, value = 1)) +
geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
geom_parallel_sets_labels(angle = 0) +
theme_no_axes()
圈图
library(sankeywheel)#来做一个圈图
sankeywheel(
from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
title = " circus plots",
subtitle = "j_gene & v_gene",
seriesName = "", width = "100%", height = "600px"
)
geneUsage(immdata$data[[1]]) %>% vis() # Visualise the V-gene distribution for the first repertoire
做geneusage有什么意义呢?
CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.
(5)repDiversity: 比较TCR或BCR的多样性
repDiversity(imm_10x$data) %>% vis(.by = "cell_type", .meta = imm_10x$meta) # Visualise the Chao1 diversity of repertoires, grouped by the patient status
也可以用其他的方法:
repDiversity(
.data,
.method = "chao1",
.col = "aa",
.max.q = 6,
.min.q = 1,
.q = 5,
.step = NA,
.quantile = c(0.025, 0.975),
.extrapolation = NA,
.perc = 50,
.norm = TRUE,
.verbose = TRUE,
.do.norm = NA,
.laplace = 0
)
.method
Pick a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref, d50, dxx.
.col
A string that specifies the column(s) to be processed. Pass one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute diversity estimations on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up.
关于这些方法的解释
-
Chao1
estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population). -
Hill
numbers are a mathematically unified family of diversity indices (differing only by an exponent q). -
div
True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant. -
gini.simp
The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types. -
inv.simp
Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest. -
gini
The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income). -
raref
Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.
# Compute statistics and visualise them
# Chao1 diversity measure
div_chao <- repDiversity(immdata$data, "chao1")
# Hill numbers
div_hill <- repDiversity(immdata$data, "hill")
# D50
div_d50 <- repDiversity(immdata$data, "d50")
# Ecological diversity measure
div_div <- repDiversity(immdata$data, "div")
p1 <- vis(div_chao)
p2 <- vis(div_chao, .by = c("Status", "Sex"), .meta = immdata$meta)
p3 <- vis(div_hill, .by = c("Status", "Sex"), .meta = immdata$meta)
p4 <- vis(div_d50)
p5 <- vis(div_d50, .by = "Status", .meta = immdata$meta)
p6 <- vis(div_div)
p3+p5
imm_raref <- repDiversity(immdata$data, "raref", .verbose = F)
p1 <- vis(imm_raref)
p2 <- vis(imm_raref, .by = "Status", .meta = immdata$meta)
p1 + p2
repDiversity(immdata$data, "raref", .verbose = F) %>% vis(.log = TRUE)
关于稀释曲线
稀释性曲线 rarefaction curve
稀释性曲线是从样本中随机抽取一定数量的个体,统计这些个体所代表的物种数目,并以个体数与物种数来构建曲线。它可以用来比较测序数据量不同的样本中物种的丰富度,也可以用来说明样本的测序数据量是否合理。采用对序列进行随机抽样的方法,以抽到的序列数与它们所能代表OTU 的数目构建rarefaction curve,当曲线趋向平坦时,说明测序数据量合理,更多的数据量只会产生少量新的OTU,反之则表明继续测序还可能产生较多新的OTU。因此,通过作稀释性曲线,可得出样品的测序深度情况。
我理解的就是相同sample size下纵坐标越高提示样本的TCR克隆多样性越高
VDJ 多样性是免疫组库分析的核心,我们引进生态学多样性的指标来刻画克隆型,为我们从总体上来看VDJ的状态。我们为什么要做多样性研究?还不是为了找出异质性吗。
下面的描述来自生物学的某个wiki,对我们研究VDJ不无启发意义。
以下是三种公认的生物物种多样性假说。它们包括:(1)异质性假说,(2)竞争假说,(3)捕食假说。
衡量多样性有三个主要原因:(1)衡量稳定性,以确定一个环境是否在退化,(2)比较两个或更多的环境,和(3)消除对广泛列表的需要(形成数据概览,即用一个多样性指标,说明了需要列举很多概念才能说清楚的东西)。多样性指数提供了一个群落组成的重要信息。这些指数不仅衡量了物种的丰富度,还考虑了物种的相对丰富度或均匀度。在测量物种多样性时,物种丰富度和均匀度必须同时考虑。此外,指数还提供了物种稀有度和共性的重要信息。
关于TCR的clonality和diversity的一点思考
这是两个相对(负相关)的概念吗?好像不是
Clonality感觉更多是看clonal proportion
比如有两个T细胞亚群
A亚群:50%都是clonotype1,剩下49%分别是49种clonotype
B亚群:有33种clonotype,每种3%的T细胞
从Clonality看,A亚群有一个很明显的主克隆
从Diversity看,同样也是A亚群多样性更好
这样好像就能理解为什么这是两个评价指标,他们并不是完全负相关的
5.降维聚类
5.1mds: multi-dimensional scaling
repOverlapAnalysis(imm_ov1, "mds") %>% vis()
用K-means聚类
repOverlapAnalysis(imm_ov1, "mds+kmeans") %>% vis()
用hclust聚类
repOverlapAnalysis(imm_ov1, "mds+hclust") %>% vis()
5.2 tsne
也可以用tsne降维
repOverlapAnalysis(imm_ov1, "tsne")%>% vis()
repOverlapAnalysis(imm_ov1, "tsne+kmeans") %>% vis()
tsne与mds得到的结果不全相同
小结
在这一部分,我们梳理一下immunarch的作用及分析的思路
- repExplore:主要是看每个样本或分组的numbers of clonotypes
- repOverlap:比较不同样本repertoire的相似性
- repOverlapAnalysis:根据overlap相似性进行降维聚类
- repClonality:主要是看clonal proportion,即主要克隆、寡克隆等分布
- repDiversity: 即看克隆多样性,有很多算法,比如Chao1, Hill, raref等
- geneUsage:即看不同片段的使用情况