Signac是一个可以处理scATAC-seq、scCUT&Tag和scNTT-seq等检测单细胞水平染色质状态的R包,其主要步骤包括:
Create object
QC
Normalization
Dimension Reduction and Clustering
Inference of gene activity
Identify differential peaks
Plotting genomic regions
本文只讨论代码实现,具体原理可以参考Stuart et al. 2021 (Single-cell chromatin state analysis with Signac)
Data Download
The inputs files of signac can be generated from cellranger
本文用到的文件包括
Raw data: Peak-by-Cell matrix. 行为基因组区域(peaks),列为细胞barcode,其中的值为比对到该peaks上的read counts(详见https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/matrices)
Metadata
-
Fragments file: 在所有细胞中检测到的不重复的测序片段数目(详见10x Genomics website)。
```{r} # example lines of fragments file #chr1 9994 10157 AAAGGTTAGTTGGGCC-1 1 #chr1 10000 10144 AGGTCTTAGCCAGGTC-1 1 #chr1 10001 10103 GTTCTCATCCGGTTGA-1 1 #chr1 10001 10208 GCTTAGTAGTTATCTC-1 1 ```
Fragments file index
# raw data
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5
# metadata
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv
# fragments file
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz
# fragments files index
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz.tbi
Installation
if (!requireNamespace("Signac", quietly = TRUE)) install.packages("Signac")
if (!requireNamespace("Seurat", quietly = TRUE)) install.packages("Seurat")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("patchwork", quietly = TRUE)) install.packages("patchwork")
if (!requireNamespace("ensembldb", quietly = TRUE)) BiocManager::install('ensembldb')
if (!requireNamespace("biovizBase", quietly = TRUE)) BiocManager::install('biovizBase')
Pre-processing
library(Signac)
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:SeuratObject':
##
## intersect
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:sp':
##
## %over%
## Loading required package: GenomeInfoDb
library(ggplot2)
library(patchwork)
Create Seurat object using peak/cell matrix and metadata
counts <- Read10X_h5(filename = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv",
header = TRUE,
row.names = 1
)
head(metadata)
## total duplicate chimeric unmapped lowmapq mitochondrial
## NO_BARCODE 20081988 3972351 3423 2118408 1159348 23059
## AAACGAAAGAAACGCC-1 4 0 0 0 0 0
## AAACGAAAGAAAGCAG-1 3 0 0 0 0 0
## AAACGAAAGAAAGGGT-1 2 0 0 0 0 0
## AAACGAAAGAAATACC-1 17 0 0 0 0 0
## AAACGAAAGAAATCTG-1 1 0 0 0 0 0
## nonprimary passed_filters is__cell_barcode excluded_reason
## NO_BARCODE 4131 12801268 0 0
## AAACGAAAGAAACGCC-1 0 4 0 0
## AAACGAAAGAAAGCAG-1 0 3 0 0
## AAACGAAAGAAAGGGT-1 0 2 0 0
## AAACGAAAGAAATACC-1 0 17 0 0
## AAACGAAAGAAATCTG-1 0 1 0 2
## TSS_fragments DNase_sensitive_region_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 1 0
## AAACGAAAGAAAGCAG-1 0 0
## AAACGAAAGAAAGGGT-1 0 0
## AAACGAAAGAAATACC-1 11 0
## AAACGAAAGAAATCTG-1 0 0
## enhancer_region_fragments promoter_region_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 0 0
## AAACGAAAGAAAGCAG-1 0 0
## AAACGAAAGAAAGGGT-1 0 0
## AAACGAAAGAAATACC-1 0 0
## AAACGAAAGAAATCTG-1 0 0
## on_target_fragments blacklist_region_fragments
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 1 0
## AAACGAAAGAAAGCAG-1 0 0
## AAACGAAAGAAAGGGT-1 0 0
## AAACGAAAGAAATACC-1 11 0
## AAACGAAAGAAATCTG-1 0 0
## peak_region_fragments peak_region_cutsites
## NO_BARCODE 0 0
## AAACGAAAGAAACGCC-1 2 4
## AAACGAAAGAAAGCAG-1 1 2
## AAACGAAAGAAAGGGT-1 1 2
## AAACGAAAGAAATACC-1 14 27
## AAACGAAAGAAATCTG-1 0 0
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz",
min.cells = 10,
min.features = 200
)
## Computing hash
chrom_assay
## ChromatinAssay data with 165434 features for 10246 cells
## Variable features: 0
## Genome:
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 1
pbmc_atac <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
pbmc_atac
## An object of class Seurat
## 165434 features across 10246 samples within 1 assay
## Active assay: peaks (165434 features, 0 variable features)
## 2 layers present: counts, data
pbmc_atac@meta.data[1:3,]
## orig.ident nCount_peaks nFeature_peaks total duplicate
## AAACGAAAGAGAGGTA-1 SeuratProject 32618 12605 59781 33159
## AAACGAAAGCAGGAGG-1 SeuratProject 13293 5392 19399 9003
## AAACGAAAGGAAGAAC-1 SeuratProject 36155 13306 64452 31013
## chimeric unmapped lowmapq mitochondrial nonprimary
## AAACGAAAGAGAGGTA-1 1 633 3228 221 17
## AAACGAAAGCAGGAGG-1 1 231 1137 2 3
## AAACGAAAGGAAGAAC-1 4 549 4538 182 4
## passed_filters is__cell_barcode excluded_reason
## AAACGAAAGAGAGGTA-1 22522 1 0
## AAACGAAAGCAGGAGG-1 9022 1 0
## AAACGAAAGGAAGAAC-1 28162 1 0
## TSS_fragments DNase_sensitive_region_fragments
## AAACGAAAGAGAGGTA-1 9962 0
## AAACGAAAGCAGGAGG-1 5362 0
## AAACGAAAGGAAGAAC-1 13887 0
## enhancer_region_fragments promoter_region_fragments
## AAACGAAAGAGAGGTA-1 0 0
## AAACGAAAGCAGGAGG-1 0 0
## AAACGAAAGGAAGAAC-1 0 0
## on_target_fragments blacklist_region_fragments
## AAACGAAAGAGAGGTA-1 9962 0
## AAACGAAAGCAGGAGG-1 5362 0
## AAACGAAAGGAAGAAC-1 13887 0
## peak_region_fragments peak_region_cutsites
## AAACGAAAGAGAGGTA-1 17067 32618
## AAACGAAAGCAGGAGG-1 6888 13293
## AAACGAAAGGAAGAAC-1 19132 36155
pbmc_atac@assays$peaks
## ChromatinAssay data with 165434 features for 10246 cells
## Variable features: 0
## Genome:
## Annotation present: FALSE
## Motifs present: FALSE
## Fragment files: 1
如果没有h5文件,可以通过counts matrix (.mtx), cell barcodes (barcodes.tsv), peak file (.bed) 来创建对象
counts <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") barcodes <- readLines("filtered_peak_bc_matrix/barcodes.tsv") peaks <- read.table("filtered_peak_bc_matrix/peaks.bed", sep="\t") peaknames <- paste(peaks$V1, peaks$V2, peaks$V3, sep="-") colnames(counts) <- barcodes rownames(counts) <- peaknames
或者直接用fragments文件创建
fragpath <- '10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz' # Define cells # If you already have a list of cell barcodes to use you can skip this step total_counts <- CountFragments(fragpath) cutoff <- 1000 # Change this number depending on your dataset! barcodes <- total_counts[total_counts$frequency_count > cutoff, ]$CB # Create a fragment object frags <- CreateFragmentObject(path = fragpath, cells = barcodes) # First call peaks on the dataset # If you already have a set of peaks you can skip this step peaks <- CallPeaks(frags) # Quantify fragments in each peak counts <- FeatureMatrix(fragments = frags, features = peaks, cells = barcodes)
接下来,只保留比对到标准染色体(22+2)上的peaks
seqlevels(granges(pbmc_atac))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5"
## [6] "chr6" "chr7" "chr8" "chr9" "chr10"
## [11] "chr11" "chr12" "chr13" "chr14" "chr15"
## [16] "chr16" "chr17" "chr18" "chr19" "chr20"
## [21] "chr21" "chr22" "chrX" "chrY" "GL000194.1"
## [26] "GL000195.1" "GL000205.2" "GL000218.1" "GL000219.1" "KI270711.1"
## [31] "KI270713.1" "KI270721.1" "KI270726.1" "KI270727.1" "KI270734.1"
peaks.keep <- seqnames(granges(pbmc_atac)) %in% standardChromosomes(granges(pbmc_atac))
pbmc_atac <- pbmc_atac[as.vector(peaks.keep), ]
Peak annotation
在该示例中,10x Genomics 使用的参考基因组文件为“GRCh38-2020-A”,其对应的注释为Ensembl v98
其他版本的对应关系可在该网站找到:https://www.ensembl.org/info/website/archives/assembly.html
接下来,使用AnnotationHub进行基因组区域注释
library(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
ah <- AnnotationHub()
# Search for the Ensembl 98 EnsDb for Homo sapiens on AnnotationHub
query(ah, "EnsDb.Hsapiens.v98")
## AnnotationHub with 1 record
## # snapshotDate(): 2024-04-30
## # names(): AH75011
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2019-05-02
## # $title: Ensembl 98 EnsDb for Homo sapiens
## # $description: Gene and protein annotations for Homo sapiens based on Ensem...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## # "Protein", "Transcript")
## # retrieve record with 'object[["AH75011"]]'
ensdb_v98 <- ah[["AH75011"]]
## loading from cache
## require("ensembldb")
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = ensdb_v98)
# change to UCSC style since the data was mapped to hg38
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg38"
# add the gene information to the object
Annotation(pbmc_atac) <- annotations
Annotation(pbmc_atac)
## GRanges object with 3200207 ranges and 5 metadata columns:
## seqnames ranges strand | tx_id gene_name
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSE00001489430 chrX 276322-276394 + | ENST00000399012 PLCXD1
## ENSE00001536003 chrX 276324-276394 + | ENST00000484611 PLCXD1
## ENSE00002160563 chrX 276353-276394 + | ENST00000430923 PLCXD1
## ENSE00001750899 chrX 281055-281121 + | ENST00000445062 PLCXD1
## ENSE00001719251 chrX 281194-281256 + | ENST00000429181 PLCXD1
## ... ... ... ... . ... ...
## ENST00000361739 chrMT 7586-8269 + | ENST00000361739 MT-CO2
## ENST00000361789 chrMT 14747-15887 + | ENST00000361789 MT-CYB
## ENST00000361851 chrMT 8366-8572 + | ENST00000361851 MT-ATP8
## ENST00000361899 chrMT 8527-9207 + | ENST00000361899 MT-ATP6
## ENST00000362079 chrMT 9207-9990 + | ENST00000362079 MT-CO3
## gene_id gene_biotype type
## <character> <character> <factor>
## ENSE00001489430 ENSG00000182378 protein_coding exon
## ENSE00001536003 ENSG00000182378 protein_coding exon
## ENSE00002160563 ENSG00000182378 protein_coding exon
## ENSE00001750899 ENSG00000182378 protein_coding exon
## ENSE00001719251 ENSG00000182378 protein_coding exon
## ... ... ... ...
## ENST00000361739 ENSG00000198712 protein_coding cds
## ENST00000361789 ENSG00000198727 protein_coding cds
## ENST00000361851 ENSG00000228253 protein_coding cds
## ENST00000361899 ENSG00000198899 protein_coding cds
## ENST00000362079 ENSG00000198938 protein_coding cds
## -------
## seqinfo: 25 sequences (1 circular) from hg38 genome
Computing QC Metrics
scATAC-seq 一般用到的QC指标包括:
- Nucleosome banding pattern: 考虑到核小体由~147bp缠绕的组蛋白八聚体组成,质量较好的scATAC-seq数据的片段大小会展示出核小体的特征(~147bp).
Signac
computes the ratio of fragments between 147 bp and 294 bp (mononucleosome) to fragments < 147 bp (nucleosome-free). Greater NS indicates poor quality.In ATAC-seq experiments, tagmentation of Tn5 transposases produces signature size pattern of fragments derived from nucleosome-free regions (NFR), mononucleosome, dinucleosome, trinucleosome and longer oligonucleosome from open chromatin regions (Figure below, adapted from Li et al ).
Transcriptional start site (TSS) enrichment score: 正在转录的基因的TSS区域常呈现开放的状态,质量较好的scATAC-seq数据的read counts也会在TSS呈现富集的特征
Total number of fragments in peaks
Fraction of fragments in peaks
Ratio reads in genomic blacklist region: ENCODE定义了一系列黑名单区域,比对到该区域的片段有可能是由于实验误差引起的. 这里Signac统计每个细胞中fragments比对到黑名单区域的比例。
blacklist regions provided by ENCODE (https://github.com/Boyle-Lab/Blacklist)
# compute nucleosome signal score per cell
pbmc_atac <- NucleosomeSignal(object = pbmc_atac)
# compute TSS enrichment score per cell
pbmc_atac <- TSSEnrichment(object = pbmc_atac)
## Extracting TSS positions
## Extracting fragments at TSSs
##
## Computing TSS enrichment score
# add fraction of reads in peaks
pbmc_atac$pct_reads_in_peaks <- pbmc_atac$peak_region_fragments / pbmc_atac$passed_filters * 100
# add blacklist ratio
pbmc_atac$blacklist_ratio <- FractionCountsInRegion(
object = pbmc_atac,
assay = 'peaks',
regions = blacklist_hg38_unified
)
pbmc_atac@meta.data[1:3,]
## orig.ident nCount_peaks nFeature_peaks total duplicate
## AAACGAAAGAGAGGTA-1 SeuratProject 32614 12603 59781 33159
## AAACGAAAGCAGGAGG-1 SeuratProject 13287 5390 19399 9003
## AAACGAAAGGAAGAAC-1 SeuratProject 36139 13300 64452 31013
## chimeric unmapped lowmapq mitochondrial nonprimary
## AAACGAAAGAGAGGTA-1 1 633 3228 221 17
## AAACGAAAGCAGGAGG-1 1 231 1137 2 3
## AAACGAAAGGAAGAAC-1 4 549 4538 182 4
## passed_filters is__cell_barcode excluded_reason
## AAACGAAAGAGAGGTA-1 22522 1 0
## AAACGAAAGCAGGAGG-1 9022 1 0
## AAACGAAAGGAAGAAC-1 28162 1 0
## TSS_fragments DNase_sensitive_region_fragments
## AAACGAAAGAGAGGTA-1 9962 0
## AAACGAAAGCAGGAGG-1 5362 0
## AAACGAAAGGAAGAAC-1 13887 0
## enhancer_region_fragments promoter_region_fragments
## AAACGAAAGAGAGGTA-1 0 0
## AAACGAAAGCAGGAGG-1 0 0
## AAACGAAAGGAAGAAC-1 0 0
## on_target_fragments blacklist_region_fragments
## AAACGAAAGAGAGGTA-1 9962 0
## AAACGAAAGCAGGAGG-1 5362 0
## AAACGAAAGGAAGAAC-1 13887 0
## peak_region_fragments peak_region_cutsites nucleosome_signal
## AAACGAAAGAGAGGTA-1 17067 32618 0.5054022
## AAACGAAAGCAGGAGG-1 6888 13293 0.4906667
## AAACGAAAGGAAGAAC-1 19132 36155 0.6323430
## nucleosome_percentile TSS.enrichment TSS.percentile
## AAACGAAAGAGAGGTA-1 0.06 6.403815 0.92
## AAACGAAAGCAGGAGG-1 0.04 8.516617 1.00
## AAACGAAAGGAAGAAC-1 0.40 4.847291 0.10
## pct_reads_in_peaks blacklist_ratio
## AAACGAAAGAGAGGTA-1 75.77924 0.0001839701
## AAACGAAAGCAGGAGG-1 76.34671 0.0001505231
## AAACGAAAGGAAGAAC-1 67.93552 0.0006087606
可视化QC指标
DensityScatter(pbmc_atac, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
summary(pbmc_atac$nucleosome_signal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2255 0.5921 0.6608 0.8395 0.7429 19.8889
pbmc_atac$nucleosome_group <- ifelse(pbmc_atac$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(pbmc_atac, group.by = 'nucleosome_group')
## Warning: Removed 63 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
VlnPlot(
object = pbmc_atac,
features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
pt.size = 0.1,
ncol = 5
)
以下使用标准过滤pbmc的数据,实际cutoff需要根据数据调整
pbmc_atac <- subset(
x = pbmc_atac,
subset = nCount_peaks > 9000 &
nCount_peaks < 100000 &
pct_reads_in_peaks > 40 &
blacklist_ratio < 0.01 &
nucleosome_signal < 4 &
TSS.enrichment > 4
)
pbmc_atac
## An object of class Seurat
## 165376 features across 9651 samples within 1 assay
## Active assay: peaks (165376 features, 0 variable features)
## 2 layers present: counts, data
Normalization and linear dimension reduction
- Normalization: Signac perform term frequency-inverse document frequency (TF-IDF) normalization. 这一步既对不同细胞的测序深度进行校正,也对rare peaks赋予更高的值
- Feature Selection: use top n% of features for dimension reduction. Here, use all features.
- Dimension reduction: run singular value decomposition (SVD) on the TF-IDF matrix. SVD与PCA类似都用于线性降维
pbmc_atac <- RunTFIDF(pbmc_atac)
## Performing TF-IDF normalization
pbmc_atac <- FindTopFeatures(pbmc_atac, min.cutoff = 'q0')
pbmc_atac <- RunSVD(pbmc_atac)
## Running SVD
## Scaling cell embeddings
降维后,可以观察singular values和测序深度的关系,避免降维矩阵捕捉到测序引起的技术误差而非生物学变量相关的信息
DepthCor(pbmc_atac)
考虑到第一个成分和细胞测序深度有较强的相关性,在后续分析中排除掉该成分进行分析。
换别的Normalization是否可以消除测序深度的variation?
Non-linear dimension reduction and clustering
pbmc_atac <- RunUMAP(pbmc_atac, reduction = 'lsi', dims = 2:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 16:16:18 UMAP embedding parameters a = 0.9922 b = 1.112
## 16:16:18 Read 9651 rows and found 29 numeric columns
## 16:16:18 Using Annoy for neighbor search, n_neighbors = 30
## 16:16:18 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 16:16:18 Writing NN index file to temp file /var/folders/yk/896pglhn555drshbgn27f71h0000gn/T//Rtmp2EI0Kk/file68bc42d57dde
## 16:16:18 Searching Annoy index using 1 thread, search_k = 3000
## 16:16:20 Annoy recall = 100%
## 16:16:20 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:16:22 Initializing from normalized Laplacian + noise (using RSpectra)
## 16:16:22 Commencing optimization for 500 epochs, with 384638 positive edges
## 16:16:29 Optimization finished
pbmc_atac <- FindNeighbors(pbmc_atac, reduction = 'lsi', dims = 2:30)
## Computing nearest neighbor graph
## Computing SNN
pbmc_atac <- FindClusters(pbmc_atac, verbose = FALSE, algorithm = 3)
DimPlot(pbmc_atac, label = TRUE) + NoLegend()
Create gene activity matrix
计算每个基因的染色质开放程度来评估基因活性。这里提取每个基因及其上游2kb的区域(作为promoter),并计算区域内fragments的数目。
注意这里是分析人类的数据,因此promoter区域设置为上游2kb,不同物种需要调整该区域的长度
gene.activities <- GeneActivity(pbmc_atac)
## Extracting gene coordinates
## Warning in SingleFeatureMatrix(fragment = fragments[[x]], features = features,
## : 13 features are on seqnames not present in the fragment file. These will be
## removed.
## Extracting reads overlapping genomic regions
# add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc_atac[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc_atac <- NormalizeData(
object = pbmc_atac,
assay = 'RNA',
normalization.method = 'LogNormalize',
scale.factor = median(pbmc_atac$nCount_RNA))
DefaultAssay(pbmc_atac) <- 'RNA'
FeaturePlot(
object = pbmc_atac,
features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
pt.size = 0.1,
max.cutoff = 'q95',
ncol = 3
)
Find differentially accessible peaks between clusters
接下来,鉴定不同clusters之间的差异peaks。
在做这一步之前,也可以基于ATAC信号进行细胞注释。如果是同时测了单细胞转录组的数据(e.g., 10x Multiome),还可以与注释好的scRNA-seq进行整合,并转移 cell labels.
这里简单展示clusters之间的差异peaks鉴定过程
# peak-level assessment
DefaultAssay(pbmc_atac) <- 'peaks'
de_peak12 <- FindMarkers(pbmc_atac,
ident.1 = "1",
ident.2 = "2",
test.use = 'wilcox',
min.pct = 0.1)
head(de_peak12)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## chr1-159076655-159077519 1.329425e-103 3.166030 0.478 0.056 2.198550e-98
## chr14-80930680-80931604 2.548539e-87 2.594749 0.465 0.076 4.214672e-82
## chr14-64755601-64756513 5.721793e-87 3.072965 0.418 0.048 9.462473e-82
## chr15-97960010-97960929 6.801525e-86 -1.258617 0.407 0.791 1.124809e-80
## chr16-56261934-56262834 6.700869e-76 3.941570 0.333 0.024 1.108163e-70
## chr1-235999140-235999977 5.621705e-71 2.158735 0.449 0.093 9.296951e-66
For all clusters
de_peak_all <- FindAllMarkers(pbmc_atac,
test.use = 'wilcox',
min.pct = 0.1)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
## Calculating cluster 13
## Calculating cluster 14
## Calculating cluster 15
## Calculating cluster 16
## Calculating cluster 17
Visualization
Selected peak
plot1 <- VlnPlot(
object = pbmc_atac,
features = 'chr17-78198651-78199583',
pt.size = 0
)
plot2 <- FeaturePlot(
object = pbmc_atac,
features = 'chr17-78198651-78199583',
pt.size = 0.1
)
wrap_elements(plot1) + plot2 + plot_layout(nrow=1, widths = c(2,1))
Enrichment analysis
接下来,可以找到与该染色体坐标较近的基因进行分析。这里我们与cluster 0的特异性peaks为例
peaks_cl0 <- subset(de_peak_all, cluster == '0' & avg_log2FC >= 4 & p_val_adj < 0.05)$gene
closest_genes_cl0 <- ClosestFeature(pbmc_atac, regions = peaks_cl0)
head(closest_genes_cl0)
## tx_id gene_name gene_id gene_biotype type
## ENSE00001143632 ENST00000232424 HES1 ENSG00000114315 protein_coding exon
## ENST00000615008 ENST00000615008 TMEM230 ENSG00000089063 protein_coding utr
## ENST00000366988 ENST00000366988 NENF ENSG00000117691 protein_coding utr
## ENSE00001818341 ENST00000493582 HDAC4 ENSG00000068024 protein_coding exon
## ENST00000544990 ENST00000544990 CPSF7 ENSG00000149532 protein_coding gap
## ENST00000610470 ENST00000610470 PRKN ENSG00000185345 protein_coding gap
## closest_region query_region distance
## ENSE00001143632 chr3-194136148-194136488 chr3-194089270-194090185 45962
## ENST00000615008 chr20-5068232-5068259 chr20-5065671-5066557 1674
## ENST00000366988 chr1-212446007-212446379 chr1-212485664-212486562 39284
## ENSE00001818341 chr2-239401402-239401654 chr2-239465135-239466062 63480
## ENST00000544990 chr11-61416520-61419948 chr11-61417414-61418287 0
## ENST00000610470 chr6-161360206-161386793 chr6-161369071-161369942 0
进行GO富集分析
library(clusterProfiler)
##
## clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
##
## Please cite:
##
## Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
## clusterProfiler: an R package for comparing biological themes among
## gene clusters. OMICS: A Journal of Integrative Biology. 2012,
## 16(5):284-287
##
## Attaching package: 'clusterProfiler'
## The following objects are masked from 'package:ensembldb':
##
## filter, select
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
library(org.Hs.eg.db)
##
library(enrichplot)
cl0_ego <- enrichGO(gene = closest_genes_cl0$gene_id,
keyType = "ENSEMBL",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)
barplot(cl0_ego,showCategory = 20)
Plot genomic regions
根据clusters之间的相似程度进行cluster排序
pbmc_atac <- SortIdents(pbmc_atac)
## Creating pseudobulk profiles for 18 variables across 165376 features
## Computing euclidean distance between pseudobulk profiles
## Clustering distance matrix
levels(Idents(pbmc_atac))
## [1] "17" "15" "7" "16" "14" "0" "10" "12" "2" "6" "11" "8" "5" "1" "13"
## [16] "4" "3" "9"
这里,我们可视化NMNAT2基因的结合谱图,该基因具有cluster 0中富集的peak
# find DA peaks overlapping gene of interest
regions_highlight <- subsetByOverlaps(StringToGRanges(peaks_cl0), LookupGeneCoords(pbmc_atac, "NMNAT2"))
CoveragePlot(
object = pbmc_atac,
region = "NMNAT2",
region.highlight = regions_highlight,
extend.upstream = 1000,
extend.downstream = 1000
)
Summary
使用Signac对单细胞染色质相关数据进行常规分析,我们通常需要:
- 创建对象
- 根据数据特征进行质量控制
- 数据校正
- 降维及聚类
- 细胞注释
- 鉴定细胞特异性染色质开放区域
- 后续的定制化分析,例如富集分析、可视化特定区域等。
以上就是对Signac分析流程的简单介绍。
完。