10X单细胞空间联合分析之单细胞数据不匹配的处理情况

又是周三，一周的黄金时间，因为疫情、父母隔离、工作的因素，最近写的少了，现在就是期望疫情早点过去，不要再限制我，躺平虽然很好，但是长期的躺平容易躺废~~~~，最后问大家一句，华大基因大家觉得怎么样？？

今天我们来研究一个很常见的问题，那就是在单细胞空间联合的时候，很多情况下我们并不能拿到匹配的单细胞空间数据，往往是只有空间数据，单细胞还要去其他文献中寻找，那么大多数情况下，单细胞空间数据就不是匹配的，这个时候，如果强行联合分析，可能会出问题，这个问题应当如何解决？这里，我给大家介绍一个方法，供大家参考。参考文章在spSeudoMap: Cell type mapping of spatial transcriptomics using unmatched single-cell RNA-seq data

图片.png

空间转录组学已被广泛用作探索各种组织中全基因组空间 RNA 表达的工具。它为以准确的方式彻底研究细胞的空间背景及其相互作用铺平了道路。空间转录组学数据的局限性之一是spot不能直接解释为细胞这一事实。因此，已经提出了多种计算方法，通过整合空间和单细胞转录组学来准确定位细胞类型。它们可进一步用于推测在各种疾病的病理生理学中起关键作用的几种细胞类型的空间浸润模式。在这种情况下，某些细胞群，如免疫细胞亚型，可以通过联合分析空间数据和从基于细胞表面标记的细胞分选策略获得的单细胞 RNA 测序 (scRNA-seq) 数据来详细描述。然而，在使用单细胞 RNA-seq 数据的空间映射方法的实际使用中存在一个主要缺点。大多数方法基于两个转录组之间的细胞类型相似的假设，并且从单细胞数据定义的细胞类型特异性特征被表征以破译空间细胞组成。因此，当使用排序后的 scRNA-seq 数据作为参考时，这些数据仅解释了空间数据中的部分细胞类型，而整合方法在估计细胞分数时会产生偏差。需要开发一种计算模型，将细胞亚群的 scRNA-seq 数据与空间转录组数据灵活整合。

参考CellDART（这个可以参考我的文章10X单细胞-10X空间转录组联合分析之七----CellDART）单细胞空间联合的方法思路，更具体地说，定义了空间数据中唯一存在的细胞类型，称为伪类型。然后，通过参考空间和单细胞转录组来分配细胞混合物中假型的分数和表达谱。混合物的其余部分充满了从单细胞数据中随机采样的细胞。结果，联合分析了修饰的细胞混合物和空间转录组数据，以获得细胞亚群的空间图。

spSeudoMap: spatial mapping of the cell subpopulation transcriptome

为了估计亚群单细胞数据（例如分类或丰富的单细胞数据集）的细胞类型空间图，定义了包含空间数据中所有细胞类型的合成细胞混合物。它旨在创建一个与空间转录组高度相似的参考数据集。对于每种混合物，分配了来自空间数据的专有细胞类型的比例，并创建了它们的合成基因表达谱。其余的细胞混合物由单细胞数据生成。该过程在 Scanpy (v.1.5.1) 和 Python (v.3.7) 中的 Numpy 中实现。

图片.png

从细胞分选实验中获得的亚群单细胞转录组的细胞类型可以使用 spSeudoMap 在空间上映射到组织。亚群单细胞数据由组织中分选出来的细胞组成，细胞类型仅涵盖空间转录组中的部分细胞。为了创建模拟空间数据的参考数据集，定义了虚拟细胞混合物，其中包括来自组织的所有细胞类型。首先，将仅存在于空间数据中的细胞类型进行聚合并命名为伪类型。假型的虚拟标记选自与单细胞假体数据相比在空间假体中高度表达的top基因。然后，从前 20 个假型标记的模块分数（Scanpy 中的 sc.tl.score_genes）估计空间spot中的假型分数。基于假定的假型分数和随机选择的空间点的表达来分配假型的分数和基因表达。最后，用从亚群单细胞数据中随机采样的细胞填充修改后的伪点的非伪型比例。最后，修改后的伪点被认为是域适应方法 CellDART 的参考数据集。

图片.png

首先，从具有随机权重的单细胞数据中随机抽取固定数量的细胞（n；brain：8 和breast：10）和细胞类型注释，并像使用 CellDART 一样创建名为 pseudospot 的细胞混合物。进行了 Wilcoxon 秩和检验，并汇集了每种细胞类型的前 $l$ 个标记。通过提取与空间数据的总基因列表相交的基因来策划标记面板（基因总数： $m$ ）。对于每个pseudospot，计算标记组的复合基因表达谱。

图片.png

接下来，将空间数据中存在但单细胞数据中不存在的细胞类型进行聚合，并将该聚合命名为“伪类型”。 The pseudotype markers were speculated from pseudobulk analysis of both transcriptomes。可以假设在空间数据中显示出比单细胞数据更大的pseudobulk表达的基因是pseudotype markers。因此，每个基因的总计数除以总计数，并在两个数据集之间比较标准化计数。空间数据与单细胞数据之间的归一化计数比率进行了 log2 转换，基因按 log 倍数变化的降序排序。选择与单细胞簇标记不重叠的前k个基因作为假型标记，前20个基因用作假型分数的预测因子。将单细胞和假型标记组合起来并命名为“composite marker panel”。

图片.png

然后，通过计算前 20 个pseudotype markers（Scanpy 中的 scanpy.tl.score_genes）的富集分数来估计空间数据中的pseudotype fraction。根据对数归一化的表达水平，将空间数据的基因分为 25 个 bin。对于每个标记基因，从同一箱中选择总共 50 个对照基因并汇集。通过从pseudotype markers中减去对照基因库的平均表达来计算富集分数。创建的模块分数的分布被缩放以具有给定的平均值（pseudof_m）和标准偏差（pseudof_std）。大于 1 和小于 0 的值分别替换为 1 和 0。假设两者之间存在线性关系，则按比例缩放的模块分数被认为是空间点的pseudotype fraction。

图片.png

To create a reference dataset, a pseudospot was aggregated to the pseudotype of a randomly chosen spot, and the combined gene expression for the composite marker panel was calculated。

图片.png

得到的细胞混合物被命名为修饰的pseudospot。由于在pseudobulk 方法中根据log 倍数变化选择pseudotype markers，因此pseudotype中pseudotype markers的表达预计将显著高于单细胞标记。此外，为简单起见，假设一个spot中所有pseudotype markers的表达与该spot的pseudotype中的表达成正比。因此，修改后的pseudospot中的单细胞标记表达设置为0，并且通过将选定点的归一化计数乘以假定的pseudotype分数来分配pseudotype标记表达。

图片.png

最后，以假型与非假型的比例为权重，将pseudotype and the pseudospot的基因表达谱相加，得到修饰pseudospot的整合表达。

Exploration of an optimal parameter range

The key parameters for spSeudoMap are the number of markers per single-cell cluster ( $N$ __ $markers$ ), the ratio of the total number of single-cell to pseudotype markers ( $m/k$ ratio), and the mean and standard deviation of the pseudotype fraction in spatial spots (pseudof_m and pseudof_std). The performance of the spSeudoMap was tested across the various parameters in human DLPFC datasets (slide number: 151676). Since the proportion of 10 layer-specific excitatory neurons was 0.53 among the single-cell data, pseudof_m was set to 0.47. The cortical layer annotation in spatial data was used as a reference, and a layer discriminative accuracy of the predicted neuronal fraction was assessed by an area under the receiver operating characteristic curve (AUROC). In general, spSeudoMap was capable of stably predicting the spatial distribution of neuron subpopulations with a median AUC over 0.5 with N_markers larger than 20, $m/k$ larger than 1, and pseudo_std larger than 0.05. The corresponding parameter ranges were selected for the downstream analyses. For the human brain (slide 151673) and mouse brain tissues, N_markers was set to 80, the m/k ratio to 4, and pseudof_std to 0.1. In human breast cancer tissue, N_markers was set to 40, the $m/k$ ratio to 2 and pseudof_std to 0.1. Other parameters for the domain adaptation were given as the user guidelines of CellDART。

最后来看看示例代码

加载

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" # run on GPU
#run on CPU: os.environ["CUDA_VISIBLE_DEVICES"]="-1"

import scanpy as sc
import pandas as pd
import numpy as np
import seaborn as sns

Load example spatial data (10X Genomics: V1_Adult_Mouse_Brain_Coronal_Section_1)

sc.set_figure_params(facecolor="white", figsize=(8, 8))
sc.settings.verbosity = 3
if not os.path.exists: os.mkdir('./data')

# adata_spatial = sc.datasets.visium_sge(
#     sample_id="V1_Adult_Mouse_Brain_Coronal_Section_1"
# )
adata_spatial = sc.read_visium('./data/V1_Adult_Mouse_Brain_Coronal_Section_1/')

Load single-nucleus data (Mouse coronal section: Sanger institute)

import urllib.request as req

url = 'https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/regression_model/RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes/sc.h5ad'
req.urlretrieve(url, './data/adata_single.h5ad')
adata_single = sc.read('./data/adata_single.h5ad', cache=True)
adata_single = adata_single.raw.to_adata()

# Assign SYMBOL names to the index
adata_single.var.index = adata_single.var['SYMBOL']

Simulation of the subpopulation single-cell dataset: select the excitatory neuron types

adata_sc_sub = adata_single[adata_single.obs['annotation_1'].str.contains('Ext_')].copy()

Preparation of the parameters for the training

Optimal parameter choices
Number of marker genes per cluster: 40 (>20)
m/k ratio = 2 (> 1)
pseudo_frac_m = average fraction of negative non-sorted population (literature evidence or cell sorting experiment)
pseudo_frac_std = 0.1 (> 0.05)
Number of pseudospots = 5 to 10 times the number of real spots (20,000~40,000 per Visium slide)
Number of sampled cells in a pseudospot (virtual mixture of single-cell data) = 8 (brain), 10 (breast cancer)
Iteration number = 3,000
Mini-batch size = 512
Loss weights between source and domain classifier (alpha) = 0.6
Learning rate = 0.001 * alpha_lr = 0.005

# column name for single-cell annotation data in metadata(.obs)
celltype = 'annotation_1'
# number of selected marker genes in each cell-type
num_markers=40
# Total number of cell mixture (modified pseudospots) to generate
npseudo = adata_spatial.shape[0]*5
npseudo
# ratio of number of single-cell markers to virtual pseudotype markers
mk_ratio = 4
# Average of presumed fraction of the pseudotype (cell types exclusively present in spatial data) across all spatial spots
# -> Presumed average non-excitatory neuron fraction from simulation dataset
pseudo_frac_m = 1 - ((adata_sc_sub.shape[0])/(adata_single.shape[0]))
pseudo_frac_m
# pseudo_frac_std: standard deviation of the distribution of presumed pseudotype fraction across all spatial spots (default = 0.1)
pseudo_frac_std = 0.1
# Number of cells sampled from single-cell data when making a pseudospot
nmix = 8
# Output directory to save models and results
out_dir = './output'

######### Run spSEudoMap

# from spSeudoMap.pred_cellf_spSeudoMap import pred_cellf_spSeudoMap
from spSeudoMap.pred_cellf_spSeudoMap import pred_cellf_spSeudoMap
adata_spatial_cellf = pred_cellf_spSeudoMap(adata_sp=adata_spatial, adata_sc=adata_sc_sub, count_from_raw=False, 
                                            gpu=True, celltype=celltype, num_markers=num_markers,
                                            mixture_mode='pseudotype', seed_num=0, 
                                            mk_ratio_fix=True, mk_ratio=mk_ratio,
                                            pseudo_frac_m=pseudo_frac_m, pseudo_frac_std=pseudo_frac_std,
                                            nmix=nmix, npseudo=npseudo, alpha=0.6, alpha_lr=5, emb_dim=64, 
                                            batch_size=512, n_iterations=3000, init_train_epoch=10, 
                                            outdir=out_dir, return_format='anndata')

######### Load AnnData with predicted cell fraction data

adata = sc.read_h5ad(os.path.join(out_dir,'model','sp_data.h5ad'))
# Spatial composition of cell types
cell_types_interest = ['Ext_L25','Ext_L23','Ext_L56','Ext_L5_1',
                       'Ext_L6B','Ext_Hpc_CA1','Ext_Hpc_CA3',
                       'Ext_Hpc_DG1','Ext_Thal_1','Ext_Pir','Ext_Amy_1','Others']

for cell_type in cell_types_interest:
    sc.pl.spatial(adata, color=cell_type+'_cellf', size=1, color_map='jet')

图片.png

生活很好，有你更好

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 219,635评论 6赞 508
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,628评论 3赞 396
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 165,971评论 0赞 356
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,986评论 1赞 295
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,006评论 6赞 394
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,784评论 1赞 307
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,475评论 3赞 420
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,364评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,860评论 1赞 317
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,008评论 3赞 338
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,152评论 1赞 351
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,829评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,490评论 3赞 331
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,035评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,156评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,428评论 3赞 373
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,127评论 2赞 356

10X单细胞空间联合分析之单细胞数据不匹配的处理情况

又是周三，一周的黄金时间，因为疫情、父母隔离、工作的因素，最近写的少了，现在就是期望疫情早点过去，不要再限制我，躺平虽然很好，但是长期的躺平容易躺废~~~~，最后问大家一句，华大基因大家觉得怎么样？？

spSeudoMap: spatial mapping of the cell subpopulation transcriptome

Exploration of an optimal parameter range

最后来看看示例代码

加载

Load example spatial data (10X Genomics: V1_Adult_Mouse_Brain_Coronal_Section_1)

Load single-nucleus data (Mouse coronal section: Sanger institute)

Simulation of the subpopulation single-cell dataset: select the excitatory neuron types

Preparation of the parameters for the training

推荐阅读更多精彩内容