Signac｜入门

Signac是一个可以处理scATAC-seq、scCUT&Tag和scNTT-seq等检测单细胞水平染色质状态的R包，其主要步骤包括：

Create object
QC
Normalization
Dimension Reduction and Clustering
Inference of gene activity
Identify differential peaks
Plotting genomic regions

本文只讨论代码实现，具体原理可以参考Stuart et al. 2021 (Single-cell chromatin state analysis with Signac)

Data Download

The inputs files of signac can be generated from cellranger

本文用到的文件包括

Raw data: Peak-by-Cell matrix. 行为基因组区域（peaks），列为细胞barcode，其中的值为比对到该peaks上的read counts（详见https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/matrices）
Metadata

Fragments file: 在所有细胞中检测到的不重复的测序片段数目（详见10x Genomics website）。

```{r}
# example lines of fragments file
#chr1    9994    10157   AAAGGTTAGTTGGGCC-1      1
#chr1    10000   10144   AGGTCTTAGCCAGGTC-1      1
#chr1    10001   10103   GTTCTCATCCGGTTGA-1      1
#chr1    10001   10208   GCTTAGTAGTTATCTC-1      1
```

Fragments file index

# raw data
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5
# metadata
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv
# fragments file
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz
# fragments files index
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz.tbi

Installation

if (!requireNamespace("Signac", quietly = TRUE)) install.packages("Signac")
if (!requireNamespace("Seurat", quietly = TRUE)) install.packages("Seurat")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("patchwork", quietly = TRUE)) install.packages("patchwork")
if (!requireNamespace("ensembldb", quietly = TRUE)) BiocManager::install('ensembldb')
if (!requireNamespace("biovizBase", quietly = TRUE)) BiocManager::install('biovizBase')

Pre-processing

library(Signac)
library(Seurat)

## Loading required package: SeuratObject

## Loading required package: sp

## 'SeuratObject' was built under R 4.4.0 but the current version is
## 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed

## 
## Attaching package: 'SeuratObject'

## The following objects are masked from 'package:base':
## 
##     intersect, t

library(GenomicRanges)

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following object is masked from 'package:SeuratObject':
## 
##     intersect

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:sp':
## 
##     %over%

## Loading required package: GenomeInfoDb

library(ggplot2)
library(patchwork)

Create Seurat object using peak/cell matrix and metadata

counts <- Read10X_h5(filename =                 "10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5")

metadata <- read.csv(
  file = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv",
  header = TRUE,
  row.names = 1
)
head(metadata)

##                       total duplicate chimeric unmapped lowmapq mitochondrial
## NO_BARCODE         20081988   3972351     3423  2118408 1159348         23059
## AAACGAAAGAAACGCC-1        4         0        0        0       0             0
## AAACGAAAGAAAGCAG-1        3         0        0        0       0             0
## AAACGAAAGAAAGGGT-1        2         0        0        0       0             0
## AAACGAAAGAAATACC-1       17         0        0        0       0             0
## AAACGAAAGAAATCTG-1        1         0        0        0       0             0
##                    nonprimary passed_filters is__cell_barcode excluded_reason
## NO_BARCODE               4131       12801268                0               0
## AAACGAAAGAAACGCC-1          0              4                0               0
## AAACGAAAGAAAGCAG-1          0              3                0               0
## AAACGAAAGAAAGGGT-1          0              2                0               0
## AAACGAAAGAAATACC-1          0             17                0               0
## AAACGAAAGAAATCTG-1          0              1                0               2
##                    TSS_fragments DNase_sensitive_region_fragments
## NO_BARCODE                     0                                0
## AAACGAAAGAAACGCC-1             1                                0
## AAACGAAAGAAAGCAG-1             0                                0
## AAACGAAAGAAAGGGT-1             0                                0
## AAACGAAAGAAATACC-1            11                                0
## AAACGAAAGAAATCTG-1             0                                0
##                    enhancer_region_fragments promoter_region_fragments
## NO_BARCODE                                 0                         0
## AAACGAAAGAAACGCC-1                         0                         0
## AAACGAAAGAAAGCAG-1                         0                         0
## AAACGAAAGAAAGGGT-1                         0                         0
## AAACGAAAGAAATACC-1                         0                         0
## AAACGAAAGAAATCTG-1                         0                         0
##                    on_target_fragments blacklist_region_fragments
## NO_BARCODE                           0                          0
## AAACGAAAGAAACGCC-1                   1                          0
## AAACGAAAGAAAGCAG-1                   0                          0
## AAACGAAAGAAAGGGT-1                   0                          0
## AAACGAAAGAAATACC-1                  11                          0
## AAACGAAAGAAATCTG-1                   0                          0
##                    peak_region_fragments peak_region_cutsites
## NO_BARCODE                             0                    0
## AAACGAAAGAAACGCC-1                     2                    4
## AAACGAAAGAAAGCAG-1                     1                    2
## AAACGAAAGAAAGGGT-1                     1                    2
## AAACGAAAGAAATACC-1                    14                   27
## AAACGAAAGAAATCTG-1                     0                    0

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = "10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz",
  min.cells = 10,
  min.features = 200
)

## Computing hash

chrom_assay

## ChromatinAssay data with 165434 features for 10246 cells
## Variable features: 0 
## Genome: 
## Annotation present: FALSE 
## Motifs present: FALSE 
## Fragment files: 1

pbmc_atac <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)
pbmc_atac

## An object of class Seurat 
## 165434 features across 10246 samples within 1 assay 
## Active assay: peaks (165434 features, 0 variable features)
##  2 layers present: counts, data

pbmc_atac@meta.data[1:3,]

##                       orig.ident nCount_peaks nFeature_peaks total duplicate
## AAACGAAAGAGAGGTA-1 SeuratProject        32618          12605 59781     33159
## AAACGAAAGCAGGAGG-1 SeuratProject        13293           5392 19399      9003
## AAACGAAAGGAAGAAC-1 SeuratProject        36155          13306 64452     31013
##                    chimeric unmapped lowmapq mitochondrial nonprimary
## AAACGAAAGAGAGGTA-1        1      633    3228           221         17
## AAACGAAAGCAGGAGG-1        1      231    1137             2          3
## AAACGAAAGGAAGAAC-1        4      549    4538           182          4
##                    passed_filters is__cell_barcode excluded_reason
## AAACGAAAGAGAGGTA-1          22522                1               0
## AAACGAAAGCAGGAGG-1           9022                1               0
## AAACGAAAGGAAGAAC-1          28162                1               0
##                    TSS_fragments DNase_sensitive_region_fragments
## AAACGAAAGAGAGGTA-1          9962                                0
## AAACGAAAGCAGGAGG-1          5362                                0
## AAACGAAAGGAAGAAC-1         13887                                0
##                    enhancer_region_fragments promoter_region_fragments
## AAACGAAAGAGAGGTA-1                         0                         0
## AAACGAAAGCAGGAGG-1                         0                         0
## AAACGAAAGGAAGAAC-1                         0                         0
##                    on_target_fragments blacklist_region_fragments
## AAACGAAAGAGAGGTA-1                9962                          0
## AAACGAAAGCAGGAGG-1                5362                          0
## AAACGAAAGGAAGAAC-1               13887                          0
##                    peak_region_fragments peak_region_cutsites
## AAACGAAAGAGAGGTA-1                 17067                32618
## AAACGAAAGCAGGAGG-1                  6888                13293
## AAACGAAAGGAAGAAC-1                 19132                36155

pbmc_atac@assays$peaks

## ChromatinAssay data with 165434 features for 10246 cells
## Variable features: 0 
## Genome: 
## Annotation present: FALSE 
## Motifs present: FALSE 
## Fragment files: 1

如果没有h5文件，可以通过counts matrix (.mtx), cell barcodes (barcodes.tsv), peak file (.bed) 来创建对象

counts <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx")
barcodes <- readLines("filtered_peak_bc_matrix/barcodes.tsv")
peaks <- read.table("filtered_peak_bc_matrix/peaks.bed", sep="\t")
peaknames <- paste(peaks$V1, peaks$V2, peaks$V3, sep="-")

colnames(counts) <- barcodes
rownames(counts) <- peaknames

或者直接用fragments文件创建

fragpath <- '10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz'

# Define cells
 # If you already have a list of cell barcodes to use you can skip this step
 total_counts <- CountFragments(fragpath)
 cutoff <- 1000 # Change this number depending on your dataset!
 barcodes <- total_counts[total_counts$frequency_count > cutoff, ]$CB
 
# Create a fragment object
 frags <- CreateFragmentObject(path = fragpath, cells = barcodes)
 
# First call peaks on the dataset
 # If you already have a set of peaks you can skip this step
 peaks <- CallPeaks(frags)

# Quantify fragments in each peak
 counts <- FeatureMatrix(fragments = frags, features = peaks, cells = barcodes)

接下来，只保留比对到标准染色体（22+2）上的peaks

seqlevels(granges(pbmc_atac))

##  [1] "chr1"       "chr2"       "chr3"       "chr4"       "chr5"      
##  [6] "chr6"       "chr7"       "chr8"       "chr9"       "chr10"     
## [11] "chr11"      "chr12"      "chr13"      "chr14"      "chr15"     
## [16] "chr16"      "chr17"      "chr18"      "chr19"      "chr20"     
## [21] "chr21"      "chr22"      "chrX"       "chrY"       "GL000194.1"
## [26] "GL000195.1" "GL000205.2" "GL000218.1" "GL000219.1" "KI270711.1"
## [31] "KI270713.1" "KI270721.1" "KI270726.1" "KI270727.1" "KI270734.1"

peaks.keep <- seqnames(granges(pbmc_atac)) %in% standardChromosomes(granges(pbmc_atac))
pbmc_atac <- pbmc_atac[as.vector(peaks.keep), ]

Peak annotation

在该示例中，10x Genomics 使用的参考基因组文件为“GRCh38-2020-A”，其对应的注释为Ensembl v98

其他版本的对应关系可在该网站找到：https://www.ensembl.org/info/website/archives/assembly.html

接下来，使用AnnotationHub进行基因组区域注释

library(AnnotationHub)

## Loading required package: BiocFileCache

## Loading required package: dbplyr

ah <- AnnotationHub()

# Search for the Ensembl 98 EnsDb for Homo sapiens on AnnotationHub
query(ah, "EnsDb.Hsapiens.v98")

## AnnotationHub with 1 record
## # snapshotDate(): 2024-04-30
## # names(): AH75011
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2019-05-02
## # $title: Ensembl 98 EnsDb for Homo sapiens
## # $description: Gene and protein annotations for Homo sapiens based on Ensem...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
## #   "Protein", "Transcript") 
## # retrieve record with 'object[["AH75011"]]'

ensdb_v98 <- ah[["AH75011"]]

## loading from cache

## require("ensembldb")

# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = ensdb_v98)

# change to UCSC style since the data was mapped to hg38
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg38"
# add the gene information to the object
Annotation(pbmc_atac) <- annotations

Annotation(pbmc_atac) 

## GRanges object with 3200207 ranges and 5 metadata columns:
##                   seqnames        ranges strand |           tx_id   gene_name
##                      <Rle>     <IRanges>  <Rle> |     <character> <character>
##   ENSE00001489430     chrX 276322-276394      + | ENST00000399012      PLCXD1
##   ENSE00001536003     chrX 276324-276394      + | ENST00000484611      PLCXD1
##   ENSE00002160563     chrX 276353-276394      + | ENST00000430923      PLCXD1
##   ENSE00001750899     chrX 281055-281121      + | ENST00000445062      PLCXD1
##   ENSE00001719251     chrX 281194-281256      + | ENST00000429181      PLCXD1
##               ...      ...           ...    ... .             ...         ...
##   ENST00000361739    chrMT     7586-8269      + | ENST00000361739      MT-CO2
##   ENST00000361789    chrMT   14747-15887      + | ENST00000361789      MT-CYB
##   ENST00000361851    chrMT     8366-8572      + | ENST00000361851     MT-ATP8
##   ENST00000361899    chrMT     8527-9207      + | ENST00000361899     MT-ATP6
##   ENST00000362079    chrMT     9207-9990      + | ENST00000362079      MT-CO3
##                           gene_id   gene_biotype     type
##                       <character>    <character> <factor>
##   ENSE00001489430 ENSG00000182378 protein_coding     exon
##   ENSE00001536003 ENSG00000182378 protein_coding     exon
##   ENSE00002160563 ENSG00000182378 protein_coding     exon
##   ENSE00001750899 ENSG00000182378 protein_coding     exon
##   ENSE00001719251 ENSG00000182378 protein_coding     exon
##               ...             ...            ...      ...
##   ENST00000361739 ENSG00000198712 protein_coding      cds
##   ENST00000361789 ENSG00000198727 protein_coding      cds
##   ENST00000361851 ENSG00000228253 protein_coding      cds
##   ENST00000361899 ENSG00000198899 protein_coding      cds
##   ENST00000362079 ENSG00000198938 protein_coding      cds
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

Computing QC Metrics

scATAC-seq 一般用到的QC指标包括：

Nucleosome banding pattern: 考虑到核小体由～147bp缠绕的组蛋白八聚体组成，质量较好的scATAC-seq数据的片段大小会展示出核小体的特征（～147bp）.

Signac computes the ratio of fragments between 147 bp and 294 bp (mononucleosome) to fragments < 147 bp (nucleosome-free). Greater NS indicates poor quality.

Quailty Control for ATAC-seq

In ATAC-seq experiments, tagmentation of Tn5 transposases produces signature size pattern of fragments derived from nucleosome-free regions (NFR), mononucleosome, dinucleosome, trinucleosome and longer oligonucleosome from open chromatin regions (Figure below, adapted from Li et al ).

Transcriptional start site (TSS) enrichment score: 正在转录的基因的TSS区域常呈现开放的状态，质量较好的scATAC-seq数据的read counts也会在TSS呈现富集的特征
Total number of fragments in peaks
Fraction of fragments in peaks
Ratio reads in genomic blacklist region: ENCODE定义了一系列黑名单区域，比对到该区域的片段有可能是由于实验误差引起的. 这里Signac统计每个细胞中fragments比对到黑名单区域的比例。

blacklist regions provided by ENCODE (https://github.com/Boyle-Lab/Blacklist)

# compute nucleosome signal score per cell
pbmc_atac <- NucleosomeSignal(object = pbmc_atac)

# compute TSS enrichment score per cell
pbmc_atac <- TSSEnrichment(object = pbmc_atac)

## Extracting TSS positions

## Extracting fragments at TSSs

## 
## Computing TSS enrichment score

# add fraction of reads in peaks
pbmc_atac$pct_reads_in_peaks <- pbmc_atac$peak_region_fragments / pbmc_atac$passed_filters * 100

# add blacklist ratio
pbmc_atac$blacklist_ratio <- FractionCountsInRegion(
  object = pbmc_atac, 
  assay = 'peaks',
  regions = blacklist_hg38_unified
)
pbmc_atac@meta.data[1:3,]

##                       orig.ident nCount_peaks nFeature_peaks total duplicate
## AAACGAAAGAGAGGTA-1 SeuratProject        32614          12603 59781     33159
## AAACGAAAGCAGGAGG-1 SeuratProject        13287           5390 19399      9003
## AAACGAAAGGAAGAAC-1 SeuratProject        36139          13300 64452     31013
##                    chimeric unmapped lowmapq mitochondrial nonprimary
## AAACGAAAGAGAGGTA-1        1      633    3228           221         17
## AAACGAAAGCAGGAGG-1        1      231    1137             2          3
## AAACGAAAGGAAGAAC-1        4      549    4538           182          4
##                    passed_filters is__cell_barcode excluded_reason
## AAACGAAAGAGAGGTA-1          22522                1               0
## AAACGAAAGCAGGAGG-1           9022                1               0
## AAACGAAAGGAAGAAC-1          28162                1               0
##                    TSS_fragments DNase_sensitive_region_fragments
## AAACGAAAGAGAGGTA-1          9962                                0
## AAACGAAAGCAGGAGG-1          5362                                0
## AAACGAAAGGAAGAAC-1         13887                                0
##                    enhancer_region_fragments promoter_region_fragments
## AAACGAAAGAGAGGTA-1                         0                         0
## AAACGAAAGCAGGAGG-1                         0                         0
## AAACGAAAGGAAGAAC-1                         0                         0
##                    on_target_fragments blacklist_region_fragments
## AAACGAAAGAGAGGTA-1                9962                          0
## AAACGAAAGCAGGAGG-1                5362                          0
## AAACGAAAGGAAGAAC-1               13887                          0
##                    peak_region_fragments peak_region_cutsites nucleosome_signal
## AAACGAAAGAGAGGTA-1                 17067                32618         0.5054022
## AAACGAAAGCAGGAGG-1                  6888                13293         0.4906667
## AAACGAAAGGAAGAAC-1                 19132                36155         0.6323430
##                    nucleosome_percentile TSS.enrichment TSS.percentile
## AAACGAAAGAGAGGTA-1                  0.06       6.403815           0.92
## AAACGAAAGCAGGAGG-1                  0.04       8.516617           1.00
## AAACGAAAGGAAGAAC-1                  0.40       4.847291           0.10
##                    pct_reads_in_peaks blacklist_ratio
## AAACGAAAGAGAGGTA-1           75.77924    0.0001839701
## AAACGAAAGCAGGAGG-1           76.34671    0.0001505231
## AAACGAAAGGAAGAAC-1           67.93552    0.0006087606

可视化QC指标

DensityScatter(pbmc_atac, x = 'nCount_peaks', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)

summary(pbmc_atac$nucleosome_signal)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2255  0.5921  0.6608  0.8395  0.7429 19.8889

pbmc_atac$nucleosome_group <- ifelse(pbmc_atac$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
FragmentHistogram(pbmc_atac, group.by = 'nucleosome_group')

## Warning: Removed 63 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

VlnPlot(
  object = pbmc_atac,
  features = c('nCount_peaks', 'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal', 'pct_reads_in_peaks'),
  pt.size = 0.1,
  ncol = 5
)

以下使用标准过滤pbmc的数据，实际cutoff需要根据数据调整

pbmc_atac <- subset(
  x = pbmc_atac,
  subset = nCount_peaks > 9000 &
    nCount_peaks < 100000 &
    pct_reads_in_peaks > 40 &
    blacklist_ratio < 0.01 &
    nucleosome_signal < 4 &
    TSS.enrichment > 4
)
pbmc_atac

## An object of class Seurat 
## 165376 features across 9651 samples within 1 assay 
## Active assay: peaks (165376 features, 0 variable features)
##  2 layers present: counts, data

Normalization and linear dimension reduction

Normalization: Signac perform term frequency-inverse document frequency (TF-IDF) normalization. 这一步既对不同细胞的测序深度进行校正，也对rare peaks赋予更高的值
Feature Selection: use top n% of features for dimension reduction. Here, use all features.
Dimension reduction: run singular value decomposition (SVD) on the TF-IDF matrix. SVD与PCA类似都用于线性降维

pbmc_atac <- RunTFIDF(pbmc_atac)

## Performing TF-IDF normalization

pbmc_atac <- FindTopFeatures(pbmc_atac, min.cutoff = 'q0')
pbmc_atac <- RunSVD(pbmc_atac)

## Running SVD

## Scaling cell embeddings

降维后，可以观察singular values和测序深度的关系，避免降维矩阵捕捉到测序引起的技术误差而非生物学变量相关的信息

DepthCor(pbmc_atac)

考虑到第一个成分和细胞测序深度有较强的相关性，在后续分析中排除掉该成分进行分析。

换别的Normalization是否可以消除测序深度的variation？

Non-linear dimension reduction and clustering

pbmc_atac <- RunUMAP(pbmc_atac, reduction = 'lsi', dims = 2:30)

## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session

## 16:16:18 UMAP embedding parameters a = 0.9922 b = 1.112

## 16:16:18 Read 9651 rows and found 29 numeric columns

## 16:16:18 Using Annoy for neighbor search, n_neighbors = 30

## 16:16:18 Building Annoy index with metric = cosine, n_trees = 50

## 0%   10   20   30   40   50   60   70   80   90   100%

## [----|----|----|----|----|----|----|----|----|----|

## **************************************************|
## 16:16:18 Writing NN index file to temp file /var/folders/yk/896pglhn555drshbgn27f71h0000gn/T//Rtmp2EI0Kk/file68bc42d57dde
## 16:16:18 Searching Annoy index using 1 thread, search_k = 3000
## 16:16:20 Annoy recall = 100%
## 16:16:20 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 16:16:22 Initializing from normalized Laplacian + noise (using RSpectra)
## 16:16:22 Commencing optimization for 500 epochs, with 384638 positive edges
## 16:16:29 Optimization finished

pbmc_atac <- FindNeighbors(pbmc_atac, reduction = 'lsi', dims = 2:30)

## Computing nearest neighbor graph
## Computing SNN

pbmc_atac <- FindClusters(pbmc_atac, verbose = FALSE, algorithm = 3)
DimPlot(pbmc_atac, label = TRUE) + NoLegend()

Create gene activity matrix

计算每个基因的染色质开放程度来评估基因活性。这里提取每个基因及其上游2kb的区域（作为promoter），并计算区域内fragments的数目。

注意这里是分析人类的数据，因此promoter区域设置为上游2kb，不同物种需要调整该区域的长度

gene.activities <- GeneActivity(pbmc_atac)

## Extracting gene coordinates

## Warning in SingleFeatureMatrix(fragment = fragments[[x]], features = features,
## : 13 features are on seqnames not present in the fragment file. These will be
## removed.

## Extracting reads overlapping genomic regions

# add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc_atac[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc_atac <- NormalizeData(
  object = pbmc_atac,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(pbmc_atac$nCount_RNA))

DefaultAssay(pbmc_atac) <- 'RNA'

FeaturePlot(
  object = pbmc_atac,
  features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 3
)

Find differentially accessible peaks between clusters

接下来，鉴定不同clusters之间的差异peaks。

在做这一步之前，也可以基于ATAC信号进行细胞注释。如果是同时测了单细胞转录组的数据（e.g., 10x Multiome），还可以与注释好的scRNA-seq进行整合，并转移 cell labels.

这里简单展示clusters之间的差异peaks鉴定过程

# peak-level assessment
DefaultAssay(pbmc_atac) <- 'peaks'

de_peak12 <- FindMarkers(pbmc_atac,
                         ident.1 = "1",
                         ident.2 = "2",
                         test.use = 'wilcox',
                         min.pct = 0.1)
head(de_peak12)

##                                  p_val avg_log2FC pct.1 pct.2    p_val_adj
## chr1-159076655-159077519 1.329425e-103   3.166030 0.478 0.056 2.198550e-98
## chr14-80930680-80931604   2.548539e-87   2.594749 0.465 0.076 4.214672e-82
## chr14-64755601-64756513   5.721793e-87   3.072965 0.418 0.048 9.462473e-82
## chr15-97960010-97960929   6.801525e-86  -1.258617 0.407 0.791 1.124809e-80
## chr16-56261934-56262834   6.700869e-76   3.941570 0.333 0.024 1.108163e-70
## chr1-235999140-235999977  5.621705e-71   2.158735 0.449 0.093 9.296951e-66

For all clusters

de_peak_all <- FindAllMarkers(pbmc_atac,
                         test.use = 'wilcox',
                         min.pct = 0.1)

## Calculating cluster 0

## Calculating cluster 1

## Calculating cluster 2

## Calculating cluster 3

## Calculating cluster 4

## Calculating cluster 5

## Calculating cluster 6

## Calculating cluster 7

## Calculating cluster 8

## Calculating cluster 9

## Calculating cluster 10

## Calculating cluster 11

## Calculating cluster 12

## Calculating cluster 13

## Calculating cluster 14

## Calculating cluster 15

## Calculating cluster 16

## Calculating cluster 17

Visualization

Selected peak

plot1 <- VlnPlot(
  object = pbmc_atac,
  features = 'chr17-78198651-78199583',
  pt.size = 0
)
plot2 <- FeaturePlot(
  object = pbmc_atac,
  features = 'chr17-78198651-78199583',
  pt.size = 0.1
)

wrap_elements(plot1) + plot2 + plot_layout(nrow=1, widths = c(2,1))

Enrichment analysis

接下来，可以找到与该染色体坐标较近的基因进行分析。这里我们与cluster 0的特异性peaks为例

peaks_cl0 <- subset(de_peak_all, cluster == '0' & avg_log2FC >= 4 & p_val_adj < 0.05)$gene
closest_genes_cl0 <- ClosestFeature(pbmc_atac, regions = peaks_cl0)
head(closest_genes_cl0)

##                           tx_id gene_name         gene_id   gene_biotype type
## ENSE00001143632 ENST00000232424      HES1 ENSG00000114315 protein_coding exon
## ENST00000615008 ENST00000615008   TMEM230 ENSG00000089063 protein_coding  utr
## ENST00000366988 ENST00000366988      NENF ENSG00000117691 protein_coding  utr
## ENSE00001818341 ENST00000493582     HDAC4 ENSG00000068024 protein_coding exon
## ENST00000544990 ENST00000544990     CPSF7 ENSG00000149532 protein_coding  gap
## ENST00000610470 ENST00000610470      PRKN ENSG00000185345 protein_coding  gap
##                           closest_region             query_region distance
## ENSE00001143632 chr3-194136148-194136488 chr3-194089270-194090185    45962
## ENST00000615008    chr20-5068232-5068259    chr20-5065671-5066557     1674
## ENST00000366988 chr1-212446007-212446379 chr1-212485664-212486562    39284
## ENSE00001818341 chr2-239401402-239401654 chr2-239465135-239466062    63480
## ENST00000544990  chr11-61416520-61419948  chr11-61417414-61418287        0
## ENST00000610470 chr6-161360206-161386793 chr6-161369071-161369942        0

进行GO富集分析

library(clusterProfiler)

## 

## clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
## 
## Please cite:
## 
## Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
## clusterProfiler: an R package for comparing biological themes among
## gene clusters. OMICS: A Journal of Integrative Biology. 2012,
## 16(5):284-287

## 
## Attaching package: 'clusterProfiler'

## The following objects are masked from 'package:ensembldb':
## 
##     filter, select

## The following object is masked from 'package:AnnotationDbi':
## 
##     select

## The following object is masked from 'package:IRanges':
## 
##     slice

## The following object is masked from 'package:S4Vectors':
## 
##     rename

## The following object is masked from 'package:stats':
## 
##     filter

library(org.Hs.eg.db)

## 

library(enrichplot)

cl0_ego <- enrichGO(gene = closest_genes_cl0$gene_id,
                keyType = "ENSEMBL",
                OrgDb = org.Hs.eg.db,
                ont = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff = 0.05,
                qvalueCutoff = 0.05,
                readable = TRUE)

barplot(cl0_ego,showCategory = 20)

Plot genomic regions

根据clusters之间的相似程度进行cluster排序

pbmc_atac <- SortIdents(pbmc_atac)

## Creating pseudobulk profiles for 18 variables across 165376 features

## Computing euclidean distance between pseudobulk profiles

## Clustering distance matrix

levels(Idents(pbmc_atac))

##  [1] "17" "15" "7"  "16" "14" "0"  "10" "12" "2"  "6"  "11" "8"  "5"  "1"  "13"
## [16] "4"  "3"  "9"

这里，我们可视化NMNAT2基因的结合谱图，该基因具有cluster 0中富集的peak

# find DA peaks overlapping gene of interest
regions_highlight <- subsetByOverlaps(StringToGRanges(peaks_cl0), LookupGeneCoords(pbmc_atac, "NMNAT2"))

CoveragePlot(
  object = pbmc_atac,
  region = "NMNAT2",
  region.highlight = regions_highlight,
  extend.upstream = 1000,
  extend.downstream = 1000
)

Summary

使用Signac对单细胞染色质相关数据进行常规分析，我们通常需要：

创建对象
根据数据特征进行质量控制
数据校正
降维及聚类
细胞注释
鉴定细胞特异性染色质开放区域
后续的定制化分析，例如富集分析、可视化特定区域等。

以上就是对Signac分析流程的简单介绍。

完。

Ref: https://stuartlab.org/signac/articles/pbmc_vignette