CellBender[Background Removal Guidance for Single Cell Gene Expression Datasets Using CellBender]

使用 CellBender 的单细胞基因表达数据集背景去除指南

介绍

10x Genomics 单细胞分析通常会产生低水平的背景 RNA 计数。一个来源是细胞悬浮液中的无细胞 RNA 分子（环境 mRNA），这可能导致 UMI 计数膨胀，即使在无细胞 GEM 中也是如此。环境 mRNA 可能来自破裂、死亡或垂死的细胞，或其他外源污染源。如果不加以考虑，尤其是如果以中等到高水平存在，背景 UMI 计数可能会导致下游分析中的系统偏差或批次效应（Fleming 等人）。

在某些组织类型中，高背景计数往往更频繁地出现（Madisson 等人），尤其是在处理单细胞核样本时。细胞核制备方案通常将细胞质 RNA 释放到溶液中（请参阅我们的细胞核分离产品）。由于背景 RNA 增加，条形码等级图可能没有陡峭的悬崖。然而，即使没有背景减法，这些数据仍然可用。

为了通过计算识别和去除环境 RNA（特别是在细胞核数据集中），可以使用社区开发的工具。在本文中，我们将使用CellBender python 包中实现的删除背景算法（Fleming 等人）。CellBender 使用机器学习模型来训练背景 RNA 谱，区分含细胞液滴和空液滴，并检索更清晰的基因表达谱。然后我们从 python 切换到 R，以进一步探索 Seurat 中的下游聚类。

先决条件

本文假设原始序列数据已经使用或bcl2fastq管道进行解复用mkfastq（即您从 FASTQ 文件开始）。此外，您必须运行计数或多cellranger 管道来对齐和注释读取、条形码错误校正和生成特征条形码矩阵。

CellBender 中的所有步骤均使用 python。如果您的系统中尚未安装 python，请安装 python。您也可以考虑使用任何IDE运行 python 。如果您没有 conda 环境，您可以尝试miniconda。

本文所用到的软件均可以在Linux服务器上运行，本文所用到的软件版本包括：

Python 3.8
CellBender v0.2.0
R v4.0
Seurat v4

安装 CellBender

创建conda环境并激活，我们使用python3.8来适合我们的系统环境。

conda create -n CellBender python=3.8
source activate CellBender

按照此处的说明安装必备模块和 CellBender 。本文使用的 CellBender 版本是 v0.2.0。

如何运行 CellBender

本节使用CellBender 教程中的命令。如果您从本节开始，请激活 CellBender 环境。

使用以下命令从输出中加载原始特征条形码矩阵文件。cellranger

cellbender remove-background \
                 --input raw_feature_bc_matrix.h5 \
                 --output output.h5 \
                 --expected-cells (value) \
                 --total-droplets-included (value) \
                 --fpr 0.01 \
                 --epochs 150

上面列出的 CellBender 参数是什么意思？

remove-background：这是 CellBender 包中的一个模块。它实现了一种快速且可扩展的最大似然 (ML) 推理算法，该算法输入原始的基因细胞计数矩阵和 UMI 信息以及某些参数。然后，它从计数矩阵中过滤任何环境 RNA（和条形码交换读取），生成一个新的计数矩阵并确定哪些条形码包含真实细胞。

--expected-cells:这对应于实验设计中预期的恢复/目标细胞数。选择您定位的数量。估算此数量的一种方法是使用Cell Ranger 输出的web_summary.html文件中找到的细胞指标。对于本教程中使用的良好和受损数据集，我们分别使用了 10k 和 5k 个细胞。

--total-droplets-included:选择一个将几千个条形码延伸到“空液滴平台”的数字。请参见下面的 UMI 曲线（图 1，参考），其中合适的初始选择是 15,000 或 20,000。但是，这会随数据集中的噪声程度而变化。UMI 曲线上此数字右侧的每个液滴都应该几乎是空的。这种 UMI 曲线可以在cellranger count 的 web summary.html 输出中看到（图 2，技术说明）。尝试包括一些您认为是空的液滴。但请注意，这个数字越大，算法运行的时间越长。在本教程中，我们使用 30,000。

image.png

左图：读取 UMI 曲线（注意过渡区域是我们在 CellBender 中感兴趣的区域） 右图：来自 Cell Ranger 网页摘要的典型条形码等级图

--fpr:默认值为 0.01。您可以通过选择几个值（例如 0.01 0.05、0.1、0.3）来生成几个输出计数矩阵，并比较下游结果（通过可视化簇和细胞类型注释结果，更多详细信息请参阅后面的部分）。在本教程中，我们对良好数据集使用默认值（0.01），对受损数据集使用 0.3。

--epochs:epochs 是机器学习算法中使用的训练数据集的完整传递次数。epochs 的数量可以设置为 1 到无穷大之间的整数值。对于单细胞 GEX 数据集，150 通常是一个不错的起始选择（参考）。

所有其他可选参数都在此处讨论。

结果与讨论

使用的数据集：在本教程中，我们将使用良好的 PBMC 数据集和受损的数据集。您也可以使用自己的任何数据集来遵循本教程。

外周血单个核细胞 10k

您可以在此处下载干净的 PBMC 10k v3 数据集，或者使用以下命令行下载：

wget https://cf.10xgenomics.com/samples/cell-exp/6.1.2/10k_PBMC_3p_nextgem_Chromium_X/10k_PBMC_3p_nextgem_Chromium_X_raw_feature_bc_matrix.h5

对于此数据集，以下命令用于运行 CellBender。在具有 12 个内核和 96 GB RAM 的本地服务器上运行此 PBMC 10k 数据集大约需要 5 小时。如果您有 CUDA GPU，您可以考虑使用模式以缩短分析时间。 --cuda

cellbender remove-background \
                 --input 10k_PBMC_3p_nextgem_Chromium_X_raw_feature_bc_matrix.h5 \
                 --output output.h5 \
                 --expected-cells 10000 \
                 --total-droplets-included 30000 \
                 --fpr 0.01 \
                 --epochs 150

在这个 10K 数据集上运行 CellBender 后，我们获得的输出之一是 PDF，其中显示了下面按等级排序的总体 UMI 图，其中包含处于过渡区域的细胞。

image.png

从生成的cell_barcodes.csv文件中，我们可以确定模糊区域包含另外 272 个细胞。这是一个干净的数据集，我们不希望有大量背景。CellBender 生成一个新的过滤计数矩阵，确定哪些条形码包含可用于进一步下游分析的细胞。CellBender 运行的所有输出文件均在此处说明。

受损的细胞核数据集

对于不太干净的数据集，经过几次迭代后，使用以下参数来运行 CellBender。

cellbender remove-background \
                 --input raw_feature_bc_matrix.h5 \
                 --output output.h5 \
                 --expected-cells 5000 \
                 --total-droplets-included 30000 \
                 --fpr 0.3 \
                 --epochs 150

在这个细胞核数据集上运行 CellBender 之后，我们得到了下面按等级排序的总体 UMI 图，该图显示了处于过渡区域的细胞。

image.png

从结果中，我们可以确定模糊区域包含另外 1,430 个细胞（基于生成的cell_barcodes.csv文件）。此模型数据集的背景水平较高，因此我们预计remove-background分析管道将返回更多细胞。CellBender 生成一个新的过滤计数矩阵，确定哪些条形码包含可用于进一步下游分析的细胞。CellBender 运行的所有输出文件均在此处说明。

下游分析：聚类和细胞类型注释

使用 CellBender 清理数据后，您可能有兴趣进行其他下游分析，例如聚类和细胞类型注释（另请参阅有关细胞类型注释方法的网络资源）。您还可以考虑在进行任何下游分析之前直接在 CellBender 生成的计数矩阵上应用任何质量控制过滤器。

在本文中，我们将进一步分析，说明如何使用Seurat 的R引导式聚类教程remove-background对 CellBender 的输出进行聚类分析。此分析是在受损的细胞核数据集上进行的。来自 CellBender管道的过滤特征条形码矩阵用作此分析的输入。细胞类型注释是使用自定义方法执行的，您可以考虑使用最适合您的数据集的注释方法。下面显示的是分别使用 Cell Ranger 和 CellBender 的输出的 UMAP 和细胞类型注释结果。

image.png

在我们的分析中，我们发现使用 Seurat 参数中 R 代码的相同值可以获得相同的簇FindClusters。两次运行都返回了相同的细胞类型，这表明直接使用 Cell Ranger 结果对于这种情况是合适的。完整的会话信息可通过下面的代码模板获得。

# ----------------------------------------------------------------------------------------------------
# Disclaimer: The code block below is for illustrations only.
# It has been referenced from various resources and modified for the dataset used for this article.
# 10x Genomics does not support or guarantee third-party tools.
======================================================================================================

####################
# Install packages #
####################
install.packages('remotes')
remotes::install_version(package = 'Seurat', version = package_version('x.y.z'))
library(Matrix)
library(Seurat)

# ----------------------------------------------------------------------------------------------------------
# The below library is for CellBender output version compatibility with the current version of Seurat v4 Tutorial.
# Please find the issue commit here (https://github.com/broadinstitute/CellBender/issues/57)
# This may or may not be applicable to CellBender versions above v0.2.0
# ============================================================================================================

ReadCB_h5 <- function(filename, use.names = TRUE, unique.features = TRUE) {
  if (!requireNamespace('hdf5r', quietly = TRUE)) {
    stop("Please install hdf5r to read HDF5 files")
  }
  if (!file.exists(filename)) {
    stop("File not found")
  }
  infile <- hdf5r::H5File$new(filename = filename, mode = 'r')
  genomes <- names(x = infile)
  output <- list()
  if (hdf5r::existsGroup(infile, 'matrix')) {
    # cellranger version 3
    message('CellRanger version 3+ format H5')
    if (use.names) {
      feature_slot <- 'features/name'
    } else {
      feature_slot <- 'features/id'
    }
  } else {
    message('CellRanger version 2 format H5')
    if (use.names) {
      feature_slot <- 'gene_names'
    } else {
      feature_slot <- 'genes'
    }
  }
  for (genome in genomes) {
    counts <- infile[[paste0(genome, '/data')]]
    indices <- infile[[paste0(genome, '/indices')]]
    indptr <- infile[[paste0(genome, '/indptr')]]
    shp <- infile[[paste0(genome, '/shape')]]
    features <- infile[[paste0(genome, '/', feature_slot)]][]
    barcodes <- infile[[paste0(genome, '/barcodes')]]
    sparse.mat <- sparseMatrix(
      i = indices[] + 1,
      p = indptr[],
      x = as.numeric(x = counts[]),
      dims = shp[],
      giveCsparse = FALSE
    )
    if (unique.features) {
      features <- make.unique(names = features)
    }
    rownames(x = sparse.mat) <- features
    colnames(x = sparse.mat) <- barcodes[]
    sparse.mat <- as(object = sparse.mat, Class = 'dgCMatrix')
    # Split v3 multimodal
    if (infile$exists(name = paste0(genome, '/features'))) {
      types <- infile[[paste0(genome, '/features/feature_type')]][]
      types.unique <- unique(x = types)
      if (length(x = types.unique) > 1) {
        message("Genome ", genome, " has multiple modalities, returning a list of matrices for this genome")
        sparse.mat <- sapply(
          X = types.unique,
          FUN = function(x) {
            return(sparse.mat[which(x = types == x), ])
          },
          simplify = FALSE,
          USE.NAMES = TRUE
        )
      }
    }
    output[[genome]] <- sparse.mat
  }
  infile$close_all()
  if (length(x = output) == 1) {
    return(output[[genome]])
  } else{
    return(output)
  }
}

# --------------------------------------------------------------------------------------------------------
# Below is an example template for running clustering analysis based on Seurat Guided clustering tutorial.
# You may consider fine tuning the parameters specific for your datasets.
# ==========================================================================================================

file <- "/Users/nisha.pillai/Documents/cbfirstrun_filtered.h5"
data <- ReadCB_h5(filename = file, use.names = TRUE)

obj <- CreateSeuratObject(counts = data, min.cells = 3, min.features = 200)
VlnPlot(obj, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)
cbres <- NormalizeData(obj)
cbres <- FindVariableFeatures(cbres, selection.method = "vst", nfeatures = 2000)
cbres <- ScaleData(cbres)
cbres <- FindVariableFeatures(cbres)
cbres = RunPCA(cbres)
cbres <- FindNeighbors(cbres, dims = 1:10)
cbres <- FindClusters(cbres, resolution = 0.5)
cbres <- RunUMAP(cbres, dims = 1:9)
DimPlot(cbres, reduction = "umap")
#cbres <- FindNeighbors(cbres, k.param = 50)

# ---------------------------------------------------------------------------------------------------------
# Differentially expressed genes for the first cluster.
# You may loop this for remaining clusters to find the top significant genes for cell annotation purposes.
# ===========================================================================================================

cluster0.markers <- FindMarkers(cbres, ident.1 = 0, min.pct = 0.25, only.pos = TRUE)

# ----------------------------------------------------------------------------------------------------
# The below annotations is based on the the top significant genes per cluster.
# For the compromised dataset, the annotations were generated using custom methods.
# You may consider using any cell type annotation method of your choice for your specific projects.
# The below rename of cluster ids works for the compromised dataset that had returned 9 clusters in UMAP;
# Please customize the renaming of cluster ids according to the UMAP clusters & annotations specific for your data.
# =====================================================================================================
new.cluster.ids <- c("Cholangiocytes", "Basal Cells", "T-cells", "Tuft cells", "Smooth muscle cells", "Fibroblasts", "Enterocytes", "Endothelial cells", "Unknown")
names(new.cluster.ids) <- levels(cbres)
cbres <- RenameIdents(cbres, new.cluster.ids)
DimPlot(cbres, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

# -----------------------------------------------
# session info used for the clustering analysis
# ===============================================
print(sessionInfo())
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] sp_1.5-0           SeuratObject_4.1.0 Seurat_4.1.1       Matrix_1.4-1

结论

总之，cellbender remove-background恢复了更多细胞，其中许多细胞位于 UMI 曲线的下方。为了验证这些细胞是否与其他细胞调用聚类在一起，可以进行聚类分析。可以从 CellBender 运行生成的过滤矩阵 h5 文件可用于第三方工具（如上图所示），或者可以将 CellBender 运行中的细胞条形码 CSV 文件作为参数提供给管道，以在cellranger reanalyzeLoupe Browser 上可视化聚类结果。

CellBender[Background Removal Guidance for Single Cell Gene Expression Datasets Using CellBender]

使用 CellBender 的单细胞基因表达数据集背景去除指南

介绍

先决条件

安装 CellBender

如何运行 CellBender

结果与讨论

外周血单个核细胞 10k

受损的细胞核数据集

下游分析：聚类和细胞类型注释

结论

推荐阅读更多精彩内容