immunarch — Fast and Seamless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires in R
前情提要:
10× Genomics单细胞免疫组库VDJ分析必知必会
免疫组库数据分析||immunarch教程:克隆型分析
免疫组库数据分析||immunarch教程:探索性数据分析
免疫组库数据分析||immunarch教程:载入10X数据
免疫组库数据分析||immunarch教程:快速开始
今天,我们继续我们的免疫组库数据分析的Demos,这一次我们来谈谈GeneUsage分析。像我这样刚入门免疫组库的人首先会问什么是GeneUsage?
我不由自主地拿出了周光炎老师的《免疫学原理》:
免疫组库的多样性是由VDJ基因重排带来的,这里的VDJ是区域名称,每个区域又有基因,虽然这些基因的名称带着明显的区域的印记。geneUsage指的就是这些基因的出现频率(次数)。在文章中它出现的形式是这样的:
Analysis of TRA variable (TRAV) and TRB variable (TRBV) gene segment usage among analyzed samples. a−d Analysis of (a, b) TRAV and (c, d) TRBV gene segment usage, including (a, c) heatmap representation of TRAV and TRBV gene usage across samples and (~) frequency of observed TRAV and TRBV gene usages among the samples
下面开始immunarch的表演。
ibrary(immunarch); data(immdata) # Load the package and the test dataset
?geneUsage
immunarch提供了一个基因片段数据表,包含几个物种的已知基因片段数量,按照IMGT的命名法。调用gene_stats()函数可以得到基因的当前统计信息:
gene_stats()
alias species ighd ighj ighv igij igkj igkv iglj iglv traj trav trbd trbj trbv trdd
1 bt BosTaurus 21 4 25 0 1 6 5 26 46 0 0 0 0 5
2 cd CamelusDromedarius 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 clf CanisLupusFamiliaris 0 0 0 0 0 0 0 0 0 0 2 8 19 0
4 dr DanioRerio 7 7 0 3 0 0 0 0 0 0 0 0 0 0
5 hs HomoSapiens 30 13 248 0 5 64 7 69 57 60 3 14 64 3
6 macmul MacacaMulatta 24 7 19 0 4 83 5 0 0 0 2 15 58 0
7 mmc MusMusculusCastaneus 0 0 0 0 0 4 0 0 0 0 0 0 0 0
8 mmd MusMusculusDomesticus 0 0 0 0 0 2 0 0 0 0 0 0 0 0
9 musmus MusMusculus 32 8 225 0 8 109 3 5 42 145 2 14 23 2
10 oa OrnithorhynchusAnatinus 3 10 0 0 0 0 0 0 0 0 0 0 0 0
11 oc OryctolagusCuniculus 10 11 39 0 8 26 2 20 0 0 0 0 0 0
12 om OncorhynchusMykiss 9 7 6 0 0 0 0 0 0 0 1 9 0 0
13 rn RattusNorvegicus 30 4 113 0 6 132 2 8 0 0 0 0 0 0
14 smth MusMusculusMolossinus 0 0 0 0 0 1 0 0 0 0 0 0 0 0
15 smth MusMusculusMusculus 0 0 0 0 0 1 0 0 0 0 0 0 0 0
16 smth MusSpretus 0 0 0 0 0 2 0 2 0 0 0 0 0 0
17 ss SusScrofa 5 5 15 0 8 19 4 14 0 0 0 0 0 0
trdj trdv trgj trgv
1 3 0 6 15
2 0 7 2 2
3 0 0 7 8
4 0 0 0 0
5 4 6 4 10
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 3 7 0 11
10 0 0 0 0
11 0 0 0 0
12 0 0 0 0
13 0 0 0 0
14 0 0 0 0
15 0 0 0 0
16 0 0 0 0
17 0 0 0 0
我们来验证一下,打开IMGT 官网: http://www.imgt.org/找到http://www.imgt.org/IMGTrepertoire/LocusGenes/#H
点进去我们就看到:
HomoSapiens 的 trdd 确实是有3个。
为了计算基因的分布,immunarch包含了geneUsage函数。它接收到一个或多个免疫库(我们读进来的数据对象)作为输入,以及你想要得到统计数据的基因和物种。例如,如果你计划使用智人的TRBV基因,你需要使用hs.trbv,函数中的trbv字符串,其中hs来自别名列,trbv是基因名称。如果你计划使用Mus musmus_ighj基因,你需要使用musmus_ighj。
# Next four function calls are equal. "hs" is from the "alias" column.
imm_gu <- geneUsage(immdata$data, "hs.trbv")
imm_gu
# A tibble: 48 x 13
Names `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191` `A4-i192` MS1 MS2 MS3 MS4 MS5 MS6
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 TRBV10-1 24 28 NA 16 29 6 19 4 43 19 13 21
2 TRBV10-2 42 60 8 29 28 16 25 35 63 39 22 43
3 TRBV10-3 230 282 135 108 376 215 195 142 304 262 179 442
4 TRBV11-1 21 14 26 17 17 16 14 10 16 32 14 20
5 TRBV11-2 183 172 125 161 95 113 94 105 174 174 122 94
6 TRBV11-3 8 11 5 24 2 7 4 13 13 13 3 32
7 TRBV12-4 603 459 313 433 333 557 406 606 452 646 537 951
8 TRBV12-5 37 54 8 38 18 17 7 17 25 32 16 24
9 TRBV13 44 53 45 78 29 43 39 28 31 33 28 47
10 TRBV14 65 91 48 73 40 30 23 94 21 84 26 7
# ... with 38 more rows
imm_gu 不就是一个丰度表吗?行是基因,列是样本,然后我们可以走我们的那一套了,pca啦,聚类啦,巴拉巴拉。
immunarch 提供了方便的可视化函数,与ggplot2无缝衔接。
p1 <- vis(imm_gu[c(1, 2)])
p2 <- vis(imm_gu[c(1, 2)]) + coord_polar()
p3<- vis(imm_gu[c(1, 2)]) + coord_flip() + theme_bw()
library(patchwork)
p1 + p2 +p3
当然这个函数的细节还是有一些需要注意的,就是你到底要计算什么。基因分布可以通过单个克隆型的计数来计算。quant = "count")或不使用它们(quant = NA)。要计算等位基因水平或家族水平分布,请更改.type参数。Parameter .norm控制immunarch是否将数据归一化,以确保所有频率的和是否等于1。
为了明确其计算的细节,我们用debug(geneUsage)来进入函数内部一探究竟。当然debug(geneUsage)是一个动态过程,我们仅演示主要结果。
debug(geneUsage)
imm_gu <- geneUsage(immdata$data, "hs.trbv")
Browse[3]> .gene
[1] "TRBV10-1" "TRBV10-2" "TRBV10-3" "TRBV11-1" "TRBV11-2" "TRBV11-3" "TRBV12-3" "TRBV12-4" "TRBV12-5"
[10] "TRBV13" "TRBV14" "TRBV15" "TRBV16" "TRBV18" "TRBV19" "TRBV2" "TRBV20-1" "TRBV24-1"
[19] "TRBV25-1" "TRBV27" "TRBV28" "TRBV29-1" "TRBV3-1" "TRBV30" "TRBV4-1" "TRBV4-2" "TRBV4-3"
[28] "TRBV5-1" "TRBV5-4" "TRBV5-5" "TRBV5-6" "TRBV5-8" "TRBV6-1" "TRBV6-2" "TRBV6-3" "TRBV6-4"
[37] "TRBV6-5" "TRBV6-6" "TRBV6-8" "TRBV6-9" "TRBV7-2" "TRBV7-3" "TRBV7-4" "TRBV7-6" "TRBV7-7"
[46] "TRBV7-8" "TRBV7-9" "TRBV9"
Browse[3]>
debug: .data[[gene_col]] <- return_segments(.data[[gene_col]])
Browse[3]>
debug: if (has_class(.data, "data.table")) {
.data <- .data %>% lazy_dt()
}
Browse[3]>
debug: dataset <- .data %>% select(Gene = gene_col, IMMCOL$count) %>%
group_by(Gene) %>% rename(Quant = IMMCOL$count)
Browse[3]>
debug: if (is.na(.quant)) {
dataset <- dataset %>% summarise(count = n())
} else {
dataset <- dataset %>% summarise(count = sum(Quant))
}
Browse[3]>
debug: dataset <- dataset %>% summarise(count = n())
Browse[3]>
debug: dataset <- dataset %>% collect(n = Inf)
Browse[3]> dataset
# A tibble: 48 x 2
Gene count
<chr> <int>
1 TRBV10-1 28
2 TRBV10-2 60
3 TRBV10-3 282
4 TRBV11-1 14
5 TRBV11-2 172
6 TRBV11-3 11
7 TRBV12-4 459
8 TRBV12-5 54
9 TRBV13 53
10 TRBV14 91
# ... with 38 more rows
Browse[3]> .data
# A tibble: 6,553 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 111 0.0131 TGCAGT~ CSASRG~ TRBV2~ TRBD1 TRBJ2~ 11 12 17 25 -1 0 7
2 93 0.0109 TGTGCC~ CASSVA~ TRBV9 TRBD1 TRBJ2~ 15 21 23 24 -1 5 0
3 66 0.00776 TGTGCC~ CASSRM~ TRBV13 TRBD1 TRBJ2~ 11 18 24 32 -1 6 7
4 59 0.00694 TGTGCC~ CASSPT~ TRBV6~ TRBD2 TRBJ2~ 10 14 19 33 -1 3 13
5 57 0.00671 TGCGCC~ CASSLD~ TRBV5~ TRBD2 TRBJ1~ 15 17 20 26 -1 1 5
6 47 0.00553 TGTGCC~ CASRGL~ TRBV6~ TRBD2 TRBJ2~ 10 11 16 20 -1 0 3
7 46 0.00541 TGCAGC~ CSVTGV~ TRBV2~ TRBD1 TRBJ2~ 8 9 13 20 -1 0 6
8 30 0.00353 TGTGCC~ CASSYL~ TRBV6~ TRBD2 TRBJ1~ 15 17 19 20 -1 1 0
9 29 0.00341 TGTGCC~ CASSLA~ TRBV5~ TRBD1 TRBJ1~ 15 21 26 32 -1 5 5
10 29 0.00341 TGTGCC~ CASSYI~ TRBV6~ TRBD1 TRBJ1~ 14 17 20 25 -1 2 4
# ... with 6,543 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> gene_col
[1] "V.name"
Browse[3]> IMMCOL$count
[1] "Clones"
Browse[3]> dataset
# A tibble: 48 x 2
Gene count
<chr> <int>
1 TRBV10-1 28
2 TRBV10-2 60
3 TRBV10-3 282
4 TRBV11-1 14
5 TRBV11-2 172
6 TRBV11-3 11
7 TRBV12-4 459
8 TRBV12-5 54
9 TRBV13 53
10 TRBV14 91
# ... with 38 more rows
Browse[3]> .data$V.name[1]
[1] "TRBV20-1"
Browse[3]> dataset %>% filter(Gene == 'TRBV20-1')
# A tibble: 1 x 2
Gene count
<chr> <int>
1 TRBV20-1 570
Browse[3]> .data %>% filter(V.name == 'TRBV20-1')
# A tibble: 570 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 111 0.0131 TGCAGT~ CSASRG~ TRBV2~ TRBD1 TRBJ2~ 11 12 17 25 -1 0 7
2 11 0.00129 TGCAGT~ CSATGL~ TRBV2~ TRBD2 TRBJ2~ 9 12 18 23 -1 2 4
3 7 0.000824 TGCAGT~ CSARPG~ TRBV2~ TRBD1 TRBJ1~ 11 14 19 24 -1 2 4
4 6 0.000706 TGCAGT~ CSARGG~ TRBV2~ TRBD2 TRBJ2~ 12 15 23 24 -1 2 0
5 5 0.000588 TGCAGT~ CSARDG~ TRBV2~ TRBD2 TRBJ1~ 13 16 20 21 -1 2 0
6 5 0.000588 TGCAGT~ CSGASG~ TRBV2~ TRBD1 TRBJ2~ 6 7 10 23 -1 0 12
7 4 0.000471 TGCAGT~ CSVGGG~ TRBV2~ TRBD2 TRBJ2~ 6 14 20 22 -1 7 1
8 4 0.000471 TGCAGT~ CSARDR~ TRBV2~ TRBD1 TRBJ1~ 13 14 18 19 -1 0 0
9 4 0.000471 TGCAGT~ CSASQF~ TRBV2~ TRBD1 TRBJ2~ 10 12 14 17 -1 1 2
10 4 0.000471 TGCAGT~ CSAGDR~ TRBV2~ TRBD1 TRBJ1~ 7 12 17 21 -1 4 3
# ... with 560 more rows, and 1 more variable: Sequence <lgl>
Browse[3]> .data %>% filter(V.name == 'TRBV20-1') %>% dim()
[1] 570 15
Browse[3]> .data %>% filter(V.name == 'TRBV10-1') %>% dim()
[1] 28 15
Browse[3]> .data %>% filter(V.name == 'TRBV10-1') %>% DT::datatable()
undebug(geneUsage)
大家看到这个28 的来源了吗? 你可以从不同的角度来观察基因使用的直方图:
imm_gu <- geneUsage(immdata$data[c(1, 2)], "hs.trbv", .norm = T)
vis(imm_gu)
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T)
vis(imm_gu, .grid = T)
如果样本有分组信息,还可以做组间的差异统计:
imm_gu <- geneUsage(immdata$data, "hs.trbv", .norm = T
vis(imm_gu, .by = "Status", .meta = immdata$meta)
vis(imm_gu, .by = "Status", .meta = immdata$meta, .plot = "box")
由于一些克隆型的基因序列不明确,geneUsage有以下选项来处理不明确的数据,也许这些不明确的基因才是潜藏在数据中的瑰宝啊。
- .ambig = "inc" - includes all possible combinations of ambiguous gene alignments from the data. NOTE: ImmunoSEQ formats use non-standart gene segment names, so it is preferable to use this argument value with ImmunoSEQ formats. This argument is ON by default to ease the gene manipulation. Feel free to change it to "exc" in case of other data formats. It is ON by default, we recommend it to leave it that way.
- .ambig = "exc" - filters out all clonotypes with ambiguous gene alignments.
- .ambig = "wei" - introduces weighted approach (divides by n (1/n) the frequency for each entry of the corresponding gene if there are n genes for a clonotype).
- .ambig = "maj" - chooses only the first gene segment.
作为福利,我们在文章中经常看到这样的V-J基因的圈图,其实也是一种桑基图啦:
咋做?
做一个不成熟的桑基图:
library(ggforce)
immdata$data$`A2-i129` %>%
gather_set_data(c(6,7,5)) %>%
ggplot(aes(x, id = id, split = y, value = 1)) +
geom_parallel_sets(aes(fill = J.name), show.legend = FALSE, alpha = 0.3) +
geom_parallel_sets_axes(axis.width = 0.1, color = "lightgrey", fill = "white") +
geom_parallel_sets_labels(angle = 0) +
theme_no_axes()
VDJ之间的流向:
然后是一个不成熟的圈图:
library(sankeywheel)
sankeywheel(
from = immdata$data$`A2-i129`$V.name, to = immdata$data$`A2-i129`$J.name,
weight = immdata$data$`A2-i129`$Clones,#type = "sankey",
title = " circus plots",
subtitle = "j_gene & v_gene",
seriesName = "", width = "100%", height = "600px"
)
关键还是要理解这里面的数据结构和算法及其背后的生物学意义啊。咦,我们为什么要做geneUsage?
因为geneUsage也是一种异质性啊,主要是可以找到哪些geneUsage较高或者受到抑制,这样我们可以对症下药啊。一如文章所言:
CDR3 sequence length distribution, amino acid conservation and gene usage variability for ATL patients resembled those of peripheral blood cells from ACs and healthy donors. Thus, determining monoclonal architecture and clonal diversity by RNA sequencing might be useful for prognostic purposes and for personalizing ATL diagnosis and assessment of treatments.
[1] Farmanbar, A., Kneller, R. & Firouzi, S. RNA sequencing identifies clonal structure of T-cell repertoires in patients with adult T-cell leukemia/lymphoma. npj Genom. Med. 4, 10 (2019). https://doi.org/10.1038/s41525-019-0084-9