单细胞免疫组学分析练习-2.1:immunarch_1

单细胞免疫组学分析练习-1:cellranger multi

immunarch是一个R包,旨在成为分析T细胞受体(TCR)和B细胞受体(BCR)的工具箱

感觉主要是用来单独分析免疫组的数据,而非将免疫组与转录组结合分析

1. 安装immunarch

install.packages("immunarch")

immunarch有内置数据集

data("immdata")
#查看
immdata$data
immdata$meta

关于这个包

2. 导入10x数据

immunarch可以直接导入的数据有:

在您的R环境中运行下面的代码,以将数据加载为Immunarch的格式。您可以在包含Cellranger输出文件的整个文件夹上运行它。repLoad将忽略不支持的文件格式。

TCR_10x <- repLoad("/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t")

== Step 1/3: loading repertoire files... ==

Processing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t" ...
  -- [1/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/airr_rearrangement.tsv" -- airr
  -- [2/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/cell_barcodes.json" --   -- [3/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/clonotypes.csv" -- unsupported format, skipping
  -- [4/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/concat_ref.bam.bai" -- unsupported format, skipping
  -- [5/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/consensus_annotations.csv" -- 10x (consensus)
  -- [6/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/consensus.bam.bai" -- unsupported format, skipping                              
  -- [7/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/filtered_contig_annotations.csv" -- 10x (filt.contigs)
  -- [8/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/vdj_contig_info.pb" -- unsupported format, skipping                             
  -- [9/9] Parsing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/vloupe.vloupe" -- unsupported format, skipping

== Step 2/3: checking metadata files and merging files... ==

Processing "/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t" ...
  -- Metadata file not found; creating a dummy metadata...

== Step 3/3: processing paired chain data... ==

Done!

通过分析load的过程我们可以看到,repLoad()可以自动识别目标文件
为了直接读取filtered_contig_annotations.csv,也可以

TCR_10x <- repLoad("/home/user/myh/raw_data/BALBC/runs/BALBC_multi/outs/per_sample_outs/BALBC_multi/vdj_t/filtered_contig_annotations.csv")

如果有多个样本,需要构建一个文件夹作为file_path,把每个filtered_contig_annotations.csv重命名为样本名并复制到该文件夹下,再做一个metadata.txt表明分组情况。

举个例子:


file_path下的文件
> imm_10x <- repLoad("/home/user/myh/raw_data/BALBC/immunarch/inputs")

== Step 1/3: loading repertoire files... ==

Processing "/home/user/myh/raw_data/BALBC/immunarch/inputs" ...
  -- [1/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/BCR.csv" -- 10x (filt.contigs)
  -- [2/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/metadata.txt" -- metadata                                                                                              
  -- [3/3] Parsing "/home/user/myh/raw_data/BALBC/immunarch/inputs/TCR.csv" -- 10x (filt.contigs)                                                     0s
                                                                                                                                                                                          
== Step 2/3: checking metadata files and merging files... ==

Processing "/home/user/myh/raw_data/BALBC/immunarch/inputs" ...
  -- Everything is OK!

== Step 3/3: processing paired chain data... ==

Done!

看一看包括的内容:

> colnames(imm_10x$data$TCR)
 [1] "Clones"           "Proportion"       "CDR3.nt"          "CDR3.aa"          "V.name"           "D.name"           "J.name"          
 [8] "V.end"            "D.start"          "D.end"            "J.start"          "VJ.ins"           "VD.ins"           "DJ.ins"          
[15] "Sequence"         "chain"            "Barcode"          "raw_clonotype_id" "ContigID"      

3. Data filtering

见此处

4.基本情况展示

(1)repExplore

repExplore(imm_10x$data, "lens") %>% vis()  # Visualise the length distribution of CDR3

可以按照分组情况展示,并进行统计学计算:

exp_vol <- repExplore(immdata$data, .method = "volume")
p1 <- vis(exp_vol, .by = c("Status"), .meta = immdata$meta)
p2 <- vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta)
p1 + p2


如果对默认出的图不满意,你可以用fixVis打开一个shiny窗口,一点一点调整直到可以发表。

这个真的很好用!

还可以看abundance

exp_cnt <- repExplore(immdata$data, .method = "count")
p2 <- vis(exp_cnt)
p5 <- vis(exp_cnt, .by = "Sex", .meta = immdata$meta)

(2)repClonality

克隆性,clonality(不是clonotypes)

来读一下help("repClonality“)


Description

repClonality function encompasses several methods to measure clonal proportions in a given repertoire.

Usage

repClonality(
.data,
.method = c("clonal.prop", "homeo", "top", "rare"),
.perc = 10,
.clone.types = c(Rare = 1e-05, Small = 1e-04, Medium = 0.001, Large = 0.01,
Hyperexpanded = 1),
.head = c(10, 100, 1000, 3000, 10000, 30000, 1e+05),
.bound = c(1, 3, 10, 30, 100)
)

Arguments

| .perc |

A single numerical value ranging from 0 to 100.

| .method |

A String with one of the following options: "clonal.prop", "homeo", "top" or "rare".

  • Set "clonal.prop" to compute clonal proportions or in other words percentage of clonotypes required to occupy specified by .perc percent of the total immune repertoire.

  • Set "homeo" to analyse relative abundance (also known as clonal space homeostasis), which is defined as the proportion of repertoire occupied by clonal groups with specific abundances.

  • Set "top" to estimate relative abundance for the groups of top clonotypes in repertoire, e.g., ten most abundant clonotypes. Use ".head" to define index intervals, such as 10, 100 and so on.

  • Set "rare" to estimate relative abundance for the groups of rare clonotypes with low counts. Use ".bound" to define the boundaries of clonotype groups.

| .clone.types |

A named numerical vector with the boundaries of the half-closed intervals that mark off clonal groups.

| .head |

A numerical vector with ranges of the top clonotypes.

| .bound |

A numerical vector with ranges of abundance for the rare clonotypes in the dataset.


repClonality(imm_10x$data, "homeo") %>% vis()  # Visualise the relative abundance of clonotypes

Top clonal proportion

imm_top <- repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000))
imm_top
imm_top %>% vis()

稀有克隆分布

imm_rare <- repClonality(immdata$data, .method = "rare")
imm_rare
imm_rare %>% vis()
结果与top clonal proportion正好相反

相对丰度

imm_hom <- repClonality(immdata$data,
                        .method = "homeo",
                        .clone.types = c(Small = .0001, Medium = .001, Large = .01, Hyperexpanded = 1)
)
imm_hom  # Visualise the relative abundance of clonotypes
vis(imm_hom) + vis(imm_hom, .by = c("Status", "Sex"), .meta = immdata$meta)

(3)repOverlap: 比较不同样本TCR/BCR repertoire的相似性

参考资料

免疫组重叠(Repertoire overlap)是最常用的度量Repertoire 相似度的方法。它是通过计算在给定的Repertoire 之间共享的克隆类型的特定统计量来实现的,也称为“公共”克隆类型。immunarch 提供了几个指标:-公共克隆型数量(.method = "public")) -一个经典的重叠相似性度量。

  • overlap coefficient (.method = "overlap") : 重叠相似度的标准化度量。它被定义为交集的大小除以两个集合中较小的部分。
  • Jaccard index (.method = "jaccard") : 它度量有限样本集之间的相似性,定义为交集的大小除以样本集并集的大小。
  • Tversky index (.method = "tversky") : 一种对集合的非对称相似度量,用来比较一个变体和一个原型。如果使用默认参数,它类似于Dice的系数。
  • cosine similarity (.method = "cosine") : 两个非零向量之间的相似性度量
  • Morisita’s overlap index (.method = "morisita") : 个体在总体中分散的统计方法。它用于比较样本之间的重叠。
  • incremental overlap : overlaps of the N most abundant clonotypes with incrementally growing N (.method = "inc+METHOD", e.g., "inc+public" or "inc+morisita").

包含所描述方法的函数是repOverlap。同样,当输出被传递到vis()函数时,输出很容易被可视化,它完成了所有的工作。

repOverlap(immdata$data) %>% vis()  # Build the heatmap of public clonotypes shared between repertoires

因为我下载的数据是单个样本的,所以用了示例数据集里的样本


还可以用热图展示

repOverlap(immdata$data) %>% vis()  # Build the heatmap of public clonotypes shared between repertoires
imm_ov1 <- repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 <- repOverlap(immdata$data, .method = "morisita", .verbose = F)

p1 <- vis(imm_ov1)
p2 <- vis(imm_ov2, .text.size = 2)

p1 + p2

vis(imm_ov1, "heatmap2")

也可以用circos图展示,不过感觉就是看起来比较fancy,实际上意义并没有特别大

vis(imm_ov1, "circos")

(4)geneUsage: 比较不同基因片段的使用情况

免疫组库的多样性是由VDJ基因重排带来的,这里的VDJ是区域名称,每个区域又有不同的基因基因片段,每个TCR重排的过程就是筛选这些基因片段的过程。geneUsage指的就是这些基因的出现频率(次数)
参考资料
immunarch提供了一个基因片段数据表,包含几个物种的已知基因片段数量,按照IMGT的命名法。调用gene_stats()函数可以得到基因的当前统计信息:

这里可以看到,小鼠TCR的alpha链的V基因有145种

如果看智人的TRBV基因,你需要使用hs.trbv
如果你计划使用Mus musmus_ighj基因,你需要使用musmus.ighj。
.gene
A character vector of length one with the name of the gene you want to analyse of the specific species. If you provide a vector of different length, only first element will be used. The string should also contain the species of interest, for example, valid ".gene" arguments are "hs.trbv", "HomoSapiens.TRBJ" or "macmul.IGHV". For details run gene_stats().

imm_gu <- geneUsage(immdata$data, "hs.trbj")#
imm_gu
一个丰度表
p1 <- vis(imm_gu[c(1, 2)])
p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()

library(patchwork)
p1 + p2 +p3 
imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)
vis(imm_gu, .grid = T)
看每个基因TRBV的利用情况
vis(imm_gu, .by = "Status", .meta = immdata$meta)
vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")
还可以比较组间利用度

boxplot也很好看

VDJ基因之间的流向

library(ggforce)
immdata$data$`A2-i129` %>%
    gather_set_data(c(6,7,5)) %>%
    ggplot(aes(x, id = id, split = y, value = 1))  +
    geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
    geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
    geom_parallel_sets_labels(angle = 0) +
    theme_no_axes()

圈图

library(sankeywheel)#来做一个圈图
sankeywheel(
  from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
  weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
  title = " circus plots",
  subtitle = "j_gene &  v_gene",
  seriesName = "", width = "100%", height = "600px"
)
geneUsage(immdata$data[[1]]) %>% vis()  # Visualise the V-gene distribution for the first repertoire

做geneusage有什么意义呢?

CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.

(5)repDiversity: 比较TCR或BCR的多样性

参考资料

repDiversity(imm_10x$data) %>% vis(.by = "cell_type", .meta = imm_10x$meta)  # Visualise the Chao1 diversity of repertoires, grouped by the patient status

也可以用其他的方法:

repDiversity(
  .data,
  .method = "chao1",
  .col = "aa",
  .max.q = 6,
  .min.q = 1,
  .q = 5,
  .step = NA,
  .quantile = c(0.025, 0.975),
  .extrapolation = NA,
  .perc = 50,
  .norm = TRUE,
  .verbose = TRUE,
  .do.norm = NA,
  .laplace = 0
)
.method 
Pick a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref, d50, dxx.
.col    
A string that specifies the column(s) to be processed. Pass one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute diversity estimations on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up.

关于这些方法的解释

  • Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).
  • Hill numbers are a mathematically unified family of diversity indices (differing only by an exponent q).
  • div True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.
  • gini.simp The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.
  • inv.simp Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.
  • gini The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).
  • raref Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.
# Compute statistics and visualise them
# Chao1 diversity measure
div_chao <- repDiversity(immdata$data, "chao1")

# Hill numbers
div_hill <- repDiversity(immdata$data, "hill")

# D50
div_d50 <- repDiversity(immdata$data, "d50")

# Ecological diversity measure
div_div <- repDiversity(immdata$data, "div")


p1 <- vis(div_chao)
p2 <- vis(div_chao, .by = c("Status", "Sex"), .meta = immdata$meta)
p3 <- vis(div_hill, .by = c("Status", "Sex"), .meta = immdata$meta)

p4 <- vis(div_d50)
p5 <- vis(div_d50, .by = "Status", .meta = immdata$meta)
p6 <- vis(div_div)

p3+p5
imm_raref <- repDiversity(immdata$data, "raref", .verbose = F)
p1 <- vis(imm_raref)
p2 <- vis(imm_raref, .by = "Status", .meta = immdata$meta)
p1 + p2
repDiversity(immdata$data, "raref", .verbose = F) %>% vis(.log = TRUE)

关于稀释曲线

稀释性曲线 rarefaction curve
  
稀释性曲线是从样本中随机抽取一定数量的个体,统计这些个体所代表的物种数目,并以个体数与物种数来构建曲线。它可以用来比较测序数据量不同的样本中物种的丰富度,也可以用来说明样本的测序数据量是否合理。采用对序列进行随机抽样的方法,以抽到的序列数与它们所能代表OTU 的数目构建rarefaction curve,当曲线趋向平坦时,说明测序数据量合理,更多的数据量只会产生少量新的OTU,反之则表明继续测序还可能产生较多新的OTU。因此,通过作稀释性曲线,可得出样品的测序深度情况。


我理解的就是相同sample size下纵坐标越高提示样本的TCR克隆多样性越高

VDJ 多样性是免疫组库分析的核心,我们引进生态学多样性的指标来刻画克隆型,为我们从总体上来看VDJ的状态。我们为什么要做多样性研究?还不是为了找出异质性吗。

下面的描述来自生物学的某个wiki,对我们研究VDJ不无启发意义。

以下是三种公认的生物物种多样性假说。它们包括:(1)异质性假说,(2)竞争假说,(3)捕食假说。
衡量多样性有三个主要原因:(1)衡量稳定性,以确定一个环境是否在退化,(2)比较两个或更多的环境,和(3)消除对广泛列表的需要(形成数据概览,即用一个多样性指标,说明了需要列举很多概念才能说清楚的东西)。多样性指数提供了一个群落组成的重要信息。这些指数不仅衡量了物种的丰富度,还考虑了物种的相对丰富度或均匀度。在测量物种多样性时,物种丰富度和均匀度必须同时考虑。此外,指数还提供了物种稀有度和共性的重要信息。

关于TCR的clonality和diversity的一点思考

这是两个相对(负相关)的概念吗?好像不是

Clonality感觉更多是看clonal proportion

比如有两个T细胞亚群

A亚群:50%都是clonotype1,剩下49%分别是49种clonotype
B亚群:有33种clonotype,每种3%的T细胞

从Clonality看,A亚群有一个很明显的主克隆
从Diversity看,同样也是A亚群多样性更好

这样好像就能理解为什么这是两个评价指标,他们并不是完全负相关的

5.降维聚类

5.1mds: multi-dimensional scaling

repOverlapAnalysis(imm_ov1, "mds") %>% vis()

用K-means聚类

repOverlapAnalysis(imm_ov1, "mds+kmeans") %>% vis()

用hclust聚类

repOverlapAnalysis(imm_ov1, "mds+hclust") %>% vis()

5.2 tsne

也可以用tsne降维

repOverlapAnalysis(imm_ov1, "tsne")%>% vis()
repOverlapAnalysis(imm_ov1, "tsne+kmeans") %>% vis()

tsne与mds得到的结果不全相同


小结

在这一部分,我们梳理一下immunarch的作用及分析的思路

  1. repExplore:主要是看每个样本或分组的numbers of clonotypes
  2. repOverlap:比较不同样本repertoire的相似性
  3. repOverlapAnalysis:根据overlap相似性进行降维聚类
  4. repClonality:主要是看clonal proportion,即主要克隆、寡克隆等分布
  5. repDiversity: 即看克隆多样性,有很多算法,比如Chao1, Hill, raref等
  6. geneUsage:即看不同片段的使用情况
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,185评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,445评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,684评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,564评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,681评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,874评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,025评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,761评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,217评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,545评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,694评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,351评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,988评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,778评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,007评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,427评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,580评论 2 349