到了年根了,其实想总结一些之前的东西,其中cell2location很想再次分享一下,好好研读一下。
Cell2location 是一种有principled Bayesian model,可以解析空间转录组数据中的 fine-grained细胞类型,并创建不同组织的综合细胞图。
Given cell type annotation for each cell, the corresponding reference cell type signatures (g_{f,g}), which represent the average mRNA count of each gene (g) in each cell type (f), can be estimated from sc/snRNA-seq data using 2 provided methods (see below). Cell2location needs untransformed unnormalised spatial mRNA counts as input. You also need to provide cell2location with the expected average cell abundance per location which is used as a prior to guide estimation of absolute cell abundance. This value depends on the tissue and can be estimated by counting nuclei for a few locations in the paired histology image but can be approximate (see paper methods for more guidance).
provide 2 methods for estimating reference cell type signatures from scRNA-seq data:
- 一种基于负二项式回归的统计方法。 通常建议使用 NB 回归,它允许跨技术和批次稳健地组合数据,从而提高空间映射精度。
- 单个基因的每个cluster平均 mRNA 计数的硬编码计算 (cell2location.cluster_averages.compute_cluster_averages)。 当批次效应较小时,这种更快的硬编码计算每个集群平均值的方法提供了类似的高准确度。
代码部分也很值得多多学习
Loading packages
import sys
#if branch is stable, will install via pypi, else will install from source
branch = "github"
user = "BayraktarLab"
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
!pip install --quiet cell2location
elif IN_COLAB and branch != "stable":
!pip install --quiet --upgrade jsonschema
!pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cell2location
import scvi
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
First, let’s define where we save the results of our analysis:
results_folder = './results/lymph_nodes_analysis/'
# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'
- 这里的用法我很喜欢, f'{results_folder}/reference_signatures'
Loading Visium and scRNA-seq reference data
首先从 10X Space Ranger 输出中读取空间 Visium 数据。 可以使用 scanpy 方便地下载和导入此数据集。
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None
- 这里对scanpy读入的对象的信息添加很值得学习。
注意! 线粒体编码的基因(基因名称以前缀 mt- 或 MT- 开头)与空间映射无关,因为它们的表达代表了单细胞和细胞核数据中的技术产物,而不是线粒体的生物学丰度。 然而,这些基因在每个位置构成了 15-40% 的 mRNA。 因此,为了避免映射伪影,我们强烈建议去除线粒体基因。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
在单细胞参考中包含了跨越淋巴结、脾脏和扁桃体的 scRNA-seq 数据集,以确保我们捕获了空间转录组数据集中可能存在的免疫细胞状态的全部多样性。
# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
!cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
# Read data
adata_ref = sc.read(f'./data/sc.h5ad')
# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
# filter the object
adata_ref = adata_ref[:, selected].copy()
Estimation of reference cell type signatures (NB regression)
The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.(这里需要考虑批次效应)
# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
# 10X reaction / sample / batch
batch_key='Sample',
# cell type, covariate used for constructing signatures
labels_key='Subset',
# multiplicative technical effects (platform, 3' vs 5', donor effect)
categorical_covariate_keys=['Method']
)
scvi.data.view_anndata_setup(adata_ref)
# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)
# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)
# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)
# Save model
mod.save(f"{ref_run_name}", overwrite=True)
# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
Examine QC plots
- 1、重建准确性以评估推理是否存在任何问题。
- 2、由于批次效应,估计的表达特征不同于每个cluster中的平均表达。 对于不受批效应影响的 scRNA-seq 数据集(该数据集有),可以使用聚类平均表达而不是用模型估计特征。 当此图与对角线图非常不同时(例如,Y 轴上的值非常低,到处都是密度),则表明特征估计存在问题。
mod.plot_QC()
The model and output h5ad can be loaded later like this:
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
Cell2location: spatial mapping
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
Note! While you can often use the default value of detection_alpha hyperparameter, it is useful to adapt the expected cell abundance N_cells_per_location to every tissue. This value can be estimated from paired histology images and as described in the note above. Change the value presented in this tutorial (N_cells_per_location=30) to the value observed in your your tissue.
# create and train the model
mod = cell2location.models.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance: tissue-dependent
# hyper-prior which can be estimated from paired histology:
N_cells_per_location=30,
# hyperparameter controlling normalisation of
# within-experiment variation in RNA detection (using default here):
detection_alpha=200
)
mod.train(max_epochs=30000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
use_gpu=True)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)
# Save model
mod.save(f"{run_name}", overwrite=True)
# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
Visualising cell abundance in spatial coordinates
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC',
'B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2'
)
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial
# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(
adata=slide,
# labels to show on a plot
color=clust_col, labels=clust_labels,
show_img=True,
# 'fast' (white background) or 'dark_background'
style='fast',
# limit color scale at 99.2% quantile of cell abundance
max_color_quantile=0.992,
# size of locations (adjust depending on figure size)
circle_diameter=6,
colorbar_position='right'
)
Downstream analysis
Identifying discrete tissue regions by Leiden clustering
通过使用由 cell2location 估计的细胞丰度对位置进行聚类来识别细胞组成不同的组织区域。
我们通过使用每种细胞类型的估计细胞丰度对 Visium 点进行聚类来找到组织区域。我们构建了一个 K-nearest neigbour (KNN) 图,表示估计细胞丰度中位置的相似性,然后应用 Leiden 聚类。 KNN 邻居的数量应适应数据集的大小和解剖学定义区域的大小(即海马区域相当小,因此可能被大型 n_neighbors 掩盖)。这可以针对范围 KNN 邻居和 Leiden 聚类分辨率完成,直到获得与组织解剖结构匹配的聚类。
聚类是在所有 Visium 部分/批次中联合完成的,因此区域身份是直接可比的。当多个批次之间存在很强的技术影响时(这里不是这种情况),原则上可以使用 sc.external.pp.bbknn 来解释 KNN 构建过程中的这些影响。
The resulting clusters are saved in adata_vis.obs['region_cluster'].
# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
n_neighbors = 15)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [8, 8]}):
sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=20)
sc.pl.umap(adata_vis, color=['sample'], size=30,
color_map = 'RdPu', ncols = 2,
legend_fontsize=20)
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(adata_vis, color=['region_cluster'],
size=1.3, img_key='hires', alpha=0.5)
Identifying cellular compartments / tissue zones using matrix factorisation (NMF)(这部分应该是新的内容)
在这里,我们使用 cell2location 映射结果来识别细胞类型的空间共现,以便更好地了解组织组织并预测细胞相互作用。 我们对来自 cell2location 的细胞类型丰度估计进行了非负矩阵分解(NMF)。 与将 NMF 应用于传统 scRNA-seq 的既定好处类似,附加 NMF 分解产生了一组空间细胞类型丰度曲线,将其分组为捕获共定位细胞类型的组件。 这种基于 NMF 的分解自然地解释了这样一个事实,即多种细胞类型和微环境可以在相同的 Visium 位置共存,同时跨组织区域(例如单个生发中心)共享信息。
提示 在实践中,最好针对一系列因子 (R={5, .., 30}) 训练 NMF,并选择 (R) 作为捕获精细组织区域和拆分已知区室之间的平衡。 如果您想找到几个最明显的细胞隔室,请使用少量因子。 如果您想找到非常强的协同定位信号并假设大多数细胞类型不协同定位,请使用很多因子(> 30 - 此处使用)。
Below we show how to perform this analysis. To aid this analysis, we wrapped the analysis shown the notebook on advanced downstream analysis into a pipeline that automates training of the NMF model with varying number of factors:
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis,
model_name='CoLocatedGroupsSklearnNMF',
train_args={
'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'n_restarts': 3 # number of training restarts
},
export_args={'path': f'{run_name}/CoLocatedComb/'}
)
For every factor number, the model produces the following list of folder outputs:
- cell_type_fractions_heatmap/: a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
- cell_type_fractions_mean/: the data used for dot plot
- factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
- models/: saved NMF models
- predictive_accuracy/: 2D histogram plot showing how well NMF explains cell2location output
- spatial/: NMF weights across locatinos in spatial coordinates
- location_factors_mean/: the data used for the plot in spatial coordiantes
- stability_plots/: stability of NMF weights between training restarts
检查的关键输出是 cell_type_fractions_heatmap/ 中的文件,它们显示了与细胞隔室相对应的 NMF 组件(列)中细胞类型(行)的估计 NMF 权重的点图。 显示的是相对权重,对每种细胞类型的组件进行了标准化。
# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()
Advanced use
Estimate cell-type specific expression of every gene in the spatial data
For this, we adapt the approach of estimating conditional expected expression proposed by RCTD (Cable et al) method. With cell2location, we can look at the posterior distribution rather than just point estimates of cell type specific expression (see mod.samples.keys()
and next section on using full distribution).
# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
mod.samples["post_sample_q05"], mod.adata
)
# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
adata_vis.layers[n] = expected_dict['mu'][i]
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']
with mpl.rc_context({'axes.facecolor': 'black'}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
from tutorial_utils import plot_genes_per_cell_type
plot_genes_per_cell_type(slide, genes, ctypes);
就是想回顾一下,当然了,软件也更新了很多内容
生活很好,有你更好