STAMP使用小技巧 | 空转的可解释性空间感知降维

STAMP (Spatial Transcriptomics Analysis with topic Modeling to uncover spatial Patterns) 是一种可解释的空间感知降维方法，专为空间转录组学数据设计。它结合了深度生成模型和主题建模 (topic modeling)，能够在保留空间信息的前提下，提取出生物学上有意义的低维“空间主题”和对应的基因模块。软件可以分析从单个样本到多个样本、多种技术平台到时间序列的各种数据，并能够返回与已知生物学功能相匹配的主题，以及相关的基因模块，其中包含高排名且已被验证的markers。

STAMP基于深度变分推断框架，在prodLDA (一种可解释的主题模型) 核心的基础上加以扩展，使用简化图卷积网络 (SGCN)，将空间邻接信息作为输入，预测主题比例，使得相邻细胞更可能有相似的主题组成。软件通过结构化 (horseshoe prior) 确保：每个基因只参与少数几个主题 (基因模块稀疏)；每个主题只包含少数关键基因 (主题内稀疏)。由此增强主题可解释性，避免冗余基因干扰主题定义。最终，输出细胞中各个主题的概率 (细胞中所有主题概率和为1) 及相应基因模块中各基因的权重。

STAMP虽然是用python编写，但使用起来还是有门槛的，并不是说代码相对复杂，而是硬件要求比较高，需要GPU加速，否则运行速度有点赶人。https://github.com/JinmiaoChenLab/scTM

import pandas as pd
import scanpy as sc
import sctm
import squidpy as sq

adata = sc.read_h5ad("adata.h5ad")
adata
AnnData object with n_obs × n_vars = 1296 × 500
    obs: 'n_counts', 'n_genes', 'percent_mito'
    uns: 'log1p', 'spatial_neighbors'
    obsm: 'nsfac', 'spatial', 'spfac'
    layers: 'counts'
    obsp: 'spatial_connectivities', 'spatial_distances'

sctm.seed.seed_everything(0)
sq.gr.spatial_neighbors(adata, n_neighs=8)
model = sctm.stamp.STAMP(adata, 5, layer = "counts", gene_likelihood ='nb')
model.train(learning_rate = 0.01, batch_size = 1296, shuffle = True, min_epochs = 800)

topic_prop = model.get_cell_by_topic()
topic_prop
        Topic1    Topic2    Topic3    Topic4    Topic5
529   0.305135  0.393314  0.234077  0.025817  0.041657
478   0.227322  0.531160  0.199549  0.015966  0.026004
234   0.267190  0.677573  0.019460  0.012246  0.023531
64    0.048493  0.612150  0.028875  0.018580  0.291902
1087  0.071748  0.345348  0.215291  0.029976  0.337637
...        ...       ...       ...       ...       ...
956   0.182585  0.749644  0.024070  0.014768  0.028933
734   0.015100  0.956518  0.009517  0.005950  0.012915
1104  0.050223  0.286376  0.291203  0.022936  0.349261
398   0.032253  0.647701  0.019077  0.013508  0.287462
1190  0.044051  0.288146  0.025481  0.355528  0.286794

beta = model.get_feature_by_topic()
beta
       Topic1    Topic2    Topic3    Topic4    Topic5
0   -0.693157 -0.703132 -0.693157 -0.693112 -0.693158
1    5.983160 -6.095079  4.213624 -4.866315 -3.211226
2    5.704674 -4.986088 -0.916948  4.661296 -4.170444
3   -6.008399 -3.022099 -5.358349 -5.072857  6.178818
4    1.000405  4.313294 -4.824944 -4.764755 -5.297919
..        ...       ...       ...       ...       ...
495  0.933249  3.708975 -3.150552 -2.673802  4.618324
496 -0.693034 -0.693114 -0.693115 -0.692851 -0.693901
497 -5.764855 -3.545012 -4.852396 -5.152513  6.229382
498 -0.693079 -0.693174 -0.692793 -0.693191 -0.693124
499 -0.693323 -0.689726 -0.693281 -0.693377 -0.693443

train这一步默认使用GPU，如果没有则会出现下面的错误提示：

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

当然，如果没有GPU可以加速计算，还是可以选择cpu模式：

model.train(learning_rate = 0.01, batch_size = 1296, shuffle = True, min_epochs = 800, device='cpu')
/software/lib/python3.8/site-packages/pyro/primitives.py:478: UserWarning: encoder.norm_topic.0.weight was not registered in the param store because requires_grad=False. You can silence this warning by calling my_module.train(False)
  warnings.warn(
Epoch Loss:802.026: 100%|███████████████████████████████████████| 800/800 [45:54<00:00,  3.44s/it]

不得不说，没有GPU的加持计算速度真的有点赶人，1296 × 500矩阵用时四十多分钟，这个测试数据量跟正常的单细胞样本相比少了很多，可想而知正常样本需要更多的时间。

此外，网络还需要给力能够访问www.immunesinglecell.org (单细胞组学数据库)，否则也会抛出下面的错误：

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.immunesinglecell.org', port=80): Max retries exceeded with url: /api/vishuo/getToolkitUrl (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f33dd4fb820>: Failed to establish a new connection: [Errno -2] Name or service not known'))

软件还内置了以下五种富集方式，其中disco即将数据与上面的数据框相比较，看数据的富集情况，其他富集功能来自于GSEAPY包。

get_topic_disco()
get_topic_enrichr()
get_topic_gsea()
get_topic_ora()

有了结果后，可以很方便地查看模块的富集情况：

sctm.analysis.get_enrichr_geneset() # ["Human", "Mouse", "Yeast", "Fly", "Fish", "Worm"]
kegg_results = sctm.analysis.get_topic_enrichr(beta, geneset = "KEGG_2019_Human")
kegg_results.head()
    Gene_set    Term    Overlap P-value Adjusted P-value    Old P-value Old Adjusted P-value    Odds Ratio  Combined Score  Genes
0   KEGG_2019_Human Estrogen signaling pathway  4/137   0.000009    0.000103    0   0   37.306391   431.915099  KRT17;KRT15;KRT14;TFF1
1   KEGG_2019_Human IL-17 signaling pathway 3/93    0.000105    0.000576    0   0   39.000000   357.384969  S100A9;S100A8;S100A7
2   KEGG_2019_Human African trypanosomiasis 2/37    0.000620    0.002272    0   0   63.317460   467.693511  HBB;HBA2
3   KEGG_2019_Human Malaria 2/49    0.001086    0.002987    0   0   47.122931   321.618214  HBB;HBA2
4   KEGG_2019_Human Nicotine addiction  1/40    0.039267    0.086388    0   0   26.910931   87.120572   GABRP

STAMP使用小技巧 | 空转的可解释性空间感知降维

推荐阅读更多精彩内容