参照 Signac 官方分析流程 Analyzing PBMC scATAC-seq • Signac (stuartlab.org)
安装
setRepositories(ind=1:3) # needed to automatically install Bioconductor dependencies
install.packages("Signac")
或使用 conda 进行安装:
conda install -c bioconda r-signac
安装基因组组装和注释R包
# Human hg 19
BiocManager::install(c('BSgenome.Hsapiens.UCSC.hg19', 'EnsDb.Hsapiens.v75'))
# Human hg38
BiocManager::install(c('BSgenome.Hsapiens.UCSC.hg38', 'EnsDb.Hsapiens.v86'))
# Mouse mm10
BiocManager::install(c('BSgenome.Mmusculus.UCSC.mm10', 'EnsDb.Mmusculus.v79'))
参考数据
以 pbmc 为例进行分析
wget https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz.tbi
数据预处理
加载 R 包
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v75)
library(ggplot2)
library(patchwork)
输入文件:
- Peak/Cell matrux: atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5 数据矩阵,行代表一个区域(peak),值表示每个peak在barcode的Tn5整合位点的数量;
- Fragment file: atac_v1_pbmc_10k_fragments.tsv.gz 所有细胞中所有唯一fragment的列表;
在 cellranger-atac count 的输出文件夹中都存在这些文件。
创建 Seurat 对象:
counts <- Read10X_h5(filename = "../vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "../vignette_data/atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = '../vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
展示 pbmc,也可以添加一些附加信息 pbmc[['peaks']]
使用 grange(pbmc)
查看对象中的每个特征相关的基因组范围等信息。
pbmc
## An object of class Seurat
## 87561 features across 8728 samples within 1 assay
## Active assay: peaks (87561 features, 0 variable features)
## 2 layers present: counts, data
另一种读入文件创建 Seurat 对象:
counts <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
barcodes <- readLines("filtered_peak_bc_matrix/barcodes.tsv")
peaks <- read.table("filtered_peak_bc_matrix/peaks.bed", sep="\t")
peaknames <- paste(peaks$V1, peaks$V2, peaks$V3, sep="-")
colnames(counts) <- barcodes
rownames(counts) <- peaknames
fragpath <- './data/atac_v1_pbmc_10k_fragments.tsv.gz'
# Define cells
# If you already have a list of cell barcodes to use you can skip this step
total_counts <- CountFragments(fragpath)
cutoff <- 1000 # Change this number depending on your dataset!
barcodes <- total_counts[total_counts$frequency_count > cutoff, ]$CB
# Create a fragment object
frags <- CreateFragmentObject(path = fragpath, cells = barcodes)
# First call peaks on the dataset
# If you already have a set of peaks you can skip this step
peaks <- CallPeaks(frags)
# Quantify fragments in each peak
counts <- FeatureMatrix(fragments = frags, features = peaks, cells = barcodes)
利用 EnsDb 数据库中的R包为 pbmc 添加注释信息:
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)
# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"
计算 QC 指标
一般使用以下指标来评估数据质量,设定过滤参数:
- Nucleosome banding pattern:核小体缠绕模式。计算每一个细胞中单核小体与无核小体片段的近似比率 (nucleosome_signal),DNA片段大小的直方图显示出与包裹在单个核小体周围的DNA长度相对应的强核小体带模式。
- TSS enrichment score:转录起始位点(TSS)富集分数。以TSS为中心的片段与TSS侧翼区域片段的比例定义了 TSS.enrichment,数值通常越高越好。
- Total number of fragments in peaks:峰中片段总数。衡量细胞测序深度/复杂性。剔除测序深度低,reads很少的细胞。具有极高水平的细胞可能表现为双峰、核团或其他人工产物,也应该剔除。
- Fraction of fragments in peaks:片段在峰中的比例。表示所有片段落在ATAC-seq峰中的比例。值较低的细胞(即<15-20%)通常是应该移除的低质量细胞或技术伪影。
- Ration reads in genomic blacklist regions:黑名单区域列表,通常是与人工信号相关的reads,应剔除。具有高比例的reads比对到这些区域的细胞(与reads比对到峰值的细胞相比)通常表示技术伪影,应该删除。blacklist 包括人类(hg19和GRCh38)、小鼠(mm10)、果蝇(dm3)和秀丽隐杆线虫(ce10)。
计算这些指标,并可视化:
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(object = pbmc)
# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)
# add blacklist ratio and fraction of reads in peaks
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
可视化变量之间的相关性:
设置 quantiles = TRUE
可以快速找到适合不同QC指标的cut off
DensityScatter(pbmc, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
可视化 TSS 富集信号:
pbmc$high.tss <- ifelse(pbmc$TSS.enrichment > 3, 'High', 'Low')
TSSPlot(pbmc, group.by = 'high.tss') + NoLegend()
可视化所有细胞的片段长度周期性:
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(object = pbmc, group.by = 'nucleosome_group')
使用小提琴图可视化各个QC
VlnPlot(
object = pbmc,
features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
pt.size = 0.1,
ncol = 5
)
通过各个QC指标,设定过滤参数:
pbmc <- subset(
x = pbmc,
subset = nCount_peaks > 3000 &
nCount_peaks < 30000 &
pct_reads_in_peaks > 15 &
blacklist_ratio < 0.05 &
nucleosome_signal < 4 &
TSS.enrichment > 3
)
pbmc
## An object of class Seurat
## 87561 features across 7307 samples within 1 assay
## Active assay: peaks (87561 features, 0 variable features)
## 2 layers present: counts, data
标准化和线性降维 (LSI)
- 标准化:TF-IDF,既可以对细胞间标准化,以纠正细胞测序深度的差异,也可以对peaks间进行标准化;
- 特征选择:可以像scRNA-seq那样,选择前n%的特征进行降维;
- 降维:使用上述特征对 TD-IDF矩阵进行奇异值分解(SVD),返回类似于scRNA-seq中的PCA输出。
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
可视化结果:
DepthCor(pbmc)
注:上述图中,LSI组分1和总计数有很强的相关性,为避免因测序深度将细胞分在一起,后续分析应去除组分1。
非线性降维和聚类
使用scRNA-seq的 Seurat 中的方法来进行基于图的聚类和非线性降维实现可视化。
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()