scATAC数据分析流程（上）

Data processing using Cell Ranger ATAC software

用Cell Ranger软件做的事情：alignment, deduplication and identification of
transposase cut sites

注意：barcode在测序过程中是可能引入测序错误的（或者别的错误，包括标签错配等等，详细可参考：测序致命伤），所以需要对不在已知标签列表的barcode进行校正：

Barcode sequences were obtained from the i5 index reads. An observed barcode not present in the whitelist of barcodes can be corrected to a whitelist barcode if it is within 2 Hamming distance away and has >90% probability of being the real barcode (based on the abundance of the barcode and quality value of incorrect bases)

数据质控的指标

Reads less than 25 bp were not aligned and flagged as unmapped. Fragments were identified as read pairs with mapping quality (MAPQ) > 30, nonmitochondrial reads and not chimerically mapped. The start and end of the fragments were adjusted (+4 for +strand and −5 for −strand) to account for the 9-bp region that the transposase enzyme occupies during the transposition.

scATAC-seq data analysis

为了解决数据稀疏性的问题，作者采用了两次聚类法：将hg19基因组切分成bins，先应用TF-IDF对约13万个bins进行降维，得到合并的bins特征，再进行初步LSI聚类，选取重要的bins把细胞粗略分成不同的cluster（利于合并peaks，获得bulk peaks)，最后再做下游的二次聚类（正式细胞分群）

data_analysis_workflow.jpg

Filtering cells by TSS enrichment and unique fragments

作者以细胞在tss处的富集程度衡量数据质量并进行过滤。

TSS 位点的获取来自 ‘TxDb.Hsapiens.UCSC.hg19.knownGene’. 因为要计算fragments在tss附近的富集分数，作者探究了tss上下游2000bp的fragments分布情况，并利用时间序列分析里常用的滑动平均值计算方法，对每51bp windows里的counts数取了平均（利于绘制平滑曲线），而富集分数就取平滑序列里的最大值。细胞的过滤基于: 单个细胞至少1000个unique fragments，tss 富集分数至少为8。当然还要考虑 doublets 的影响，作者移除了那些有45000以上unique fragments的细胞

Filter_Cells.png

Generating a counts matrix

统计每个细胞内peaks和feature重叠的counts数

we used ‘findOverlaps’ to find all overlaps with the feature by insertions. Then we added a column with the unique identity (ID) (integer) cell barcode to the overlaps object and fed this into a sparseMatrix in R. To calculate the fraction of Tn5 insertions in peaks, we used the colSums of the sparseMatrix and divided it by the number of insertions for each cell ID barcode using ‘table’ in R.

Generating union peak sets with LSI

降维方法比较独特，是TF-IDF结合SVD的降维方法：

This normalization resulted in a TF-IDF matrix that was used as the input to irlba’s singular value decomposition (SVD) implementation in R.

用TF-IDF降维以后，作者保留了第2至第25个dimensions（原因是first dimension was associated with cell read depth），并创建了Seurat对象，进行了第一轮聚类：注意Seurat版本

Seurat’s SNN graph clustering (v.2.3) with ‘FindClusters’ with a default resolution of 0.8

同时作者在sup里也补充说明了第一轮聚类只是粗略地聚类，每个cluster至少要有200个细胞，作者自己手写的addCluster函数中的一个细节就是对cluster size进行判断，如果少于200，就降低resolution的值

If the minimum cluster size was below 200 cells, the resolution was decreased until this criterion was reached, leading to a final resolution of 0.8 × N (where N represents the iterations until the minimum cluster size is 200 cells).

Reads-in-peaks-normalized bigwigs and sequencing tracks

可视化，peaks count文件转成bigwig文件

we created a run-length encoding using ‘coverage’ in R and normalized the total reads to a scale factor that normalized the reads-in-peaks to 10 million reads within peaks. This object was then converted into a bigwig using rtracklayer ‘export.bw’ in R. For plotting tracks, the bigwigs were read into R using rtracklayer ‘import.bw(as = ”Rle”)’ and plotted within R or visualized with WashU Epigenome browser

ATAC-seq-centric LSI clustering and visualization

We then used the first 50 reduced dimensions as input into a Seurat object, and crude clusters were identified using Seurat’s (v2.3) SNN graph clustering ‘FindClusters’ with a default resolution of 0.8. We found that there was detectable batch effect that confounded further analyses. To attenuate this batch effect, we calculated the cluster sums from the binarized accessibility matrix and then log-normalized using edgeR’s ‘cpm(matrix, log = TRUE, prior.count = 3)’ in R

第二轮聚类的结果，基于合并的所有的peaks，以及用25000个在cluster间变异性最大的peaks做聚类，注意到peaks的变异性需要log转换（logMat)。原文用的UMAP，我用的TSNE:

LSI-Clustering-Peaks.png

TF footprinting

使用来自Inferred Sequence Binding Preferences (CIS-BP) motifs (from chromVAR motifs human_pwms_v1)的motif数据集进行比对，matchMotifs获取motif比对情况；详细计算原理是：

对每个sample计算获得hexamer bias table（we computed the Tn5 bias for each sample by constructing a hexamer bias table，意思应该是计算Tn5本身结合序列的偏好性，用来校正背景expected Tn5 hexamer frequency)，以及对每个TF计算hexamers在距离motif 中心±250 bp范围内的counts数（observed Tn5 hexamer frequency）

寻找显著的TF occupancy：计算方法如下

Using the sample’s hexamer frequency table, we could then compute the expected Tn5 insertions by multiplying the hexamer position frequency table by the observed/expected Tn5 hexamer frequency. For analysis of TF motifs present in the +5 kb enhancer of PDCD1, we searched for CIS-BP motifs with a LogOdds threshold greater than 10.

同时为了保证footprints的可重复性，作者使用了降采样方法

To assess the reproducibility of footprints, we subsampled fragments in each cluster 2 times at a sampling rate of 60% to have maximum variability.

ChromVAR

这个接着上一步，计算global TF activity：计算TF在不同样本里活性的GC bias-corrected deviation scores，详细可以见http://www.bioconductor.org/packages/release/bioc/vignettes/chromVAR/inst/doc/Introduction.html，具体这个包的函数怎么用的我会在后续代码实战中介绍

Computing gene activity scores using Cicero co-accessibility

很重要的包Cicero，专门处理scATAC的数据，解决了这类稀疏性数据预测enhancer-promoter pairs的问题：http://www.bioconductor.org/packages/release/bioc/vignettes/cicero/inst/doc/website.html#introduction

原理是给定k值先进行k近邻聚类，聚类的每个cluster作为bulk整合peaks数据（k值的设定很重要），并制作cicero_cds对象：

Cicero calculates peak-to-peak links based on their co-accessibility across groups of cells that are aggregated using a nearest-neighbor approach (k = 50)

we created a ‘cicero_cds’ with k = 50 and the ‘reduced_coordinates’ from the corresponding UMAP coordinates. This function returns aggregated accessibility across groupings of cells based on nearest-neighbor rules.

以peaks中心对所有peaks进行resize，每个peaks都归一化成+-250kb的窗口，计算这些归一化后的peaks和原来peaks中心的重叠情况，意思是如果peaks之间距离小于250kb就会算成是一个peak-to-peak link

We then used this aggregated accessibility matrix to identify all peak-to-peak linkages that were within 250 kb by resizing the peaks to 250 kb and then overlapping them with the peak summits/centers

获得peak-to-peak link后，对每一个link计算Pearson相关系数，获得connections data.frame，前两列分别为peaki和peakj，第三列为Pearson相关系数

接着，作者利用TxDb.Hsapiens. UCSC.hg19.knownGene制作了gene data.frame，全部resize成其tss位点，并以tss为中心制作±2.5 kb的windows，这样一来就可以利用该基因注释信息对cicero_cds对象进行注释，并进一步利用build_gene_activity_matrix函数（输入connectons和cicero_cds)和基于co-accessibility计算gene score（比如一个tss如果它的linked peaks对应的TF counts数多，那么这个tss对应的gene得分就高）

Analysis of autoimmune variants using Cicero co-accessibility and chromVAR

当然除了用cicero分析gene activity，还可以分析开放染色质区和snp这类疾病易感位点的"links"

首先从http://pubs.broadinstitute.org/pubs/finemapping/dataportal.php上下载casal snps及其enhancer注释，然后把snp转成GRange对象，求其与ATAC-seq peaks的overlap；相当于把上一步的peak-to-peak link改成peak-to-snp overlap；而且作者还考虑到了那些同样拥有snp位点的peaks之间的link：

beyond direct overlaps, we used Cicero co-accessibility links to include peaks that were co-accessible with those containing SNPs. To do this analysis, for every peak with an SNP overlap, we calculated and included peaks that were co-accessible above 0.35.

同样地，在样本间snp分布的偏向性，可以用chromVAR计算deviation score

引用：[https://doi.org/10.1038/s41587-019-0206-z]