scATAC数据分析流程(上)

Data processing using Cell Ranger ATAC software

用Cell Ranger软件做的事情:alignment, deduplication and identification of
transposase cut sites

注意:barcode在测序过程中是可能引入测序错误的(或者别的错误,包括标签错配等等,详细可参考:测序致命伤),所以需要对不在已知标签列表的barcode进行校正:

Barcode sequences were obtained from the i5 index reads. An observed barcode not present in the whitelist of barcodes can be corrected to a whitelist barcode if it is within 2 Hamming distance away and has >90% probability of being the real barcode (based on the abundance of the barcode and quality value of incorrect bases)

数据质控的指标

Reads less than 25 bp were not aligned and flagged as unmapped. Fragments were identified as read pairs with mapping quality (MAPQ) > 30, nonmitochondrial reads and not chimerically mapped. The start and end of the fragments were adjusted (+4 for +strand and −5 for −strand) to account for the 9-bp region that the transposase enzyme occupies during the transposition.

scATAC-seq data analysis

为了解决数据稀疏性的问题,作者采用了两次聚类法:将hg19基因组切分成bins,先应用TF-IDF对约13万个bins进行降维,得到合并的bins特征,再进行初步LSI聚类,选取重要的bins把细胞粗略分成不同的cluster(利于合并peaks,获得bulk peaks),最后再做下游的二次聚类(正式细胞分群)

data_analysis_workflow.jpg

Filtering cells by TSS enrichment and unique fragments

作者以细胞在tss处的富集程度衡量数据质量并进行过滤。

TSS 位点的获取来自 ‘TxDb.Hsapiens.UCSC.hg19.knownGene’. 因为要计算fragments在tss附近的富集分数,作者探究了tss上下游2000bp的fragments分布情况,并利用时间序列分析里常用的滑动平均值计算方法,对每51bp windows里的counts数取了平均(利于绘制平滑曲线),而富集分数就取平滑序列里的最大值。细胞的过滤基于: 单个细胞至少1000个unique fragments,tss 富集分数至少为8。当然还要考虑 doublets 的影响,作者移除了那些有45000以上unique fragments的细胞

Filter_Cells.png

Generating a counts matrix

统计每个细胞内peaks和feature重叠的counts数

we used ‘findOverlaps’ to find all overlaps with the feature by insertions. Then we added a column with the unique identity (ID) (integer) cell barcode to the overlaps object and fed this into a sparseMatrix in R. To calculate the fraction of Tn5 insertions in peaks, we used the colSums of the sparseMatrix and divided it by the number of insertions for each cell ID barcode using ‘table’ in R.

Generating union peak sets with LSI

降维方法比较独特,是TF-IDF结合SVD的降维方法:

This normalization resulted in a TF-IDF matrix that was used as the input to irlba’s singular value decomposition (SVD) implementation in R.

用TF-IDF降维以后,作者保留了第2至第25个dimensions(原因是first dimension was associated with cell read depth) ,并创建了Seurat对象,进行了第一轮聚类:注意Seurat版本

Seurat’s SNN graph clustering (v.2.3) with ‘FindClusters’ with a default resolution of 0.8

同时作者在sup里也补充说明了第一轮聚类只是粗略地聚类,每个cluster至少要有200个细胞,作者自己手写的addCluster函数中的一个细节就是对cluster size进行判断,如果少于200,就降低resolution的值

If the minimum cluster size was below 200 cells, the resolution was decreased until this criterion was reached, leading to a final resolution of 0.8 × N (where N represents the iterations until the minimum cluster size is 200 cells).

Reads-in-peaks-normalized bigwigs and sequencing tracks

可视化,peaks count文件转成bigwig文件

we created a run-length encoding using ‘coverage’ in R and normalized the total reads to a scale factor that normalized the reads-in-peaks to 10 million reads within peaks. This object was then converted into a bigwig using rtracklayer ‘export.bw’ in R. For plotting tracks, the bigwigs were read into R using rtracklayer ‘import.bw(as = ”Rle”)’ and plotted within R or visualized with WashU Epigenome browser

ATAC-seq-centric LSI clustering and visualization

We then used the first 50 reduced dimensions as input into a Seurat object, and crude clusters were identified using Seurat’s (v2.3) SNN graph clustering ‘FindClusters’ with a default resolution of 0.8. We found that there was detectable batch effect that confounded further analyses. To attenuate this batch effect, we calculated the cluster sums from the binarized accessibility matrix and then log-normalized using edgeR’s ‘cpm(matrix, log = TRUE, prior.count = 3)’ in R

第二轮聚类的结果,基于合并的所有的peaks,以及用25000个在cluster间变异性最大的peaks做聚类,注意到peaks的变异性需要log转换(logMat)。原文用的UMAP,我用的TSNE:


LSI-Clustering-Peaks.png

TF footprinting

使用来自Inferred Sequence Binding Preferences (CIS-BP) motifs (from chromVAR motifs human_pwms_v1)的motif数据集进行比对,matchMotifs获取motif比对情况;详细计算原理是:

对每个sample计算获得hexamer bias table(we computed the Tn5 bias for each sample by constructing a hexamer bias table,意思应该是计算Tn5本身结合序列的偏好性,用来校正背景expected Tn5 hexamer frequency),以及对每个TF计算hexamers在距离motif 中心±250 bp范围内的counts数(observed Tn5 hexamer frequency)

寻找显著的TF occupancy:计算方法如下

Using the sample’s hexamer frequency table, we could then compute the expected Tn5 insertions by multiplying the hexamer position frequency table by the observed/expected Tn5 hexamer frequency. For analysis of TF motifs present in the +5 kb enhancer of PDCD1, we searched for CIS-BP motifs with a LogOdds threshold greater than 10.

同时为了保证footprints的可重复性,作者使用了降采样方法

To assess the reproducibility of footprints, we subsampled fragments in each cluster 2 times at a sampling rate of 60% to have maximum variability.

ChromVAR

这个接着上一步,计算global TF activity:计算TF在不同样本里活性的GC bias-corrected deviation scores,详细可以见http://www.bioconductor.org/packages/release/bioc/vignettes/chromVAR/inst/doc/Introduction.html,具体这个包的函数怎么用的我会在后续代码实战中介绍

Computing gene activity scores using Cicero co-accessibility

很重要的包Cicero,专门处理scATAC的数据,解决了这类稀疏性数据预测enhancer-promoter pairs的问题:http://www.bioconductor.org/packages/release/bioc/vignettes/cicero/inst/doc/website.html#introduction

原理是给定k值先进行k近邻聚类,聚类的每个cluster作为bulk整合peaks数据(k值的设定很重要),并制作cicero_cds对象:

Cicero calculates peak-to-peak links based on their co-accessibility across groups of cells that are aggregated using a nearest-neighbor approach (k = 50)

we created a ‘cicero_cds’ with k = 50 and the ‘reduced_coordinates’ from the corresponding UMAP coordinates. This function returns aggregated accessibility across groupings of cells based on nearest-neighbor rules.

以peaks中心对所有peaks进行resize,每个peaks都归一化成+-250kb的窗口,计算这些归一化后的peaks和原来peaks中心的重叠情况,意思是如果peaks之间距离小于250kb就会算成是一个peak-to-peak link

We then used this aggregated accessibility matrix to identify all peak-to-peak linkages that were within 250 kb by resizing the peaks to 250 kb and then overlapping them with the peak summits/centers

获得peak-to-peak link后,对每一个link计算Pearson相关系数,获得connections data.frame,前两列分别为peaki和peakj,第三列为Pearson相关系数

接着,作者利用TxDb.Hsapiens. UCSC.hg19.knownGene制作了gene data.frame,全部resize成其tss位点,并以tss为中心制作±2.5 kb的windows,这样一来就可以利用该基因注释信息对cicero_cds对象进行注释,并进一步利用build_gene_activity_matrix函数(输入connectons和cicero_cds)和基于co-accessibility计算gene score(比如一个tss如果它的linked peaks对应的TF counts数多,那么这个tss对应的gene得分就高)

Analysis of autoimmune variants using Cicero co-accessibility and chromVAR

当然除了用cicero分析gene activity,还可以分析开放染色质区和snp这类疾病易感位点的"links"

首先从http://pubs.broadinstitute.org/pubs/finemapping/dataportal.php上下载casal snps及其enhancer注释,然后把snp转成GRange对象,求其与ATAC-seq peaks的overlap;相当于把上一步的peak-to-peak link改成peak-to-snp overlap;而且作者还考虑到了那些同样拥有snp位点的peaks之间的link:

beyond direct overlaps, we used Cicero co-accessibility links to include peaks that were co-accessible with those containing SNPs. To do this analysis, for every peak with an SNP overlap, we calculated and included peaks that were co-accessible above 0.35.

同样地,在样本间snp分布的偏向性,可以用chromVAR计算deviation score

引用:[https://doi.org/10.1038/s41587-019-0206-z]

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 221,695评论 6 515
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 94,569评论 3 399
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 168,130评论 0 360
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,648评论 1 297
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,655评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 52,268评论 1 309
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,835评论 3 421
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,740评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 46,286评论 1 318
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,375评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,505评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 36,185评论 5 350
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,873评论 3 333
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,357评论 0 24
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,466评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,921评论 3 376
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,515评论 2 359

推荐阅读更多精彩内容