放一张效果图,划分区域相当不错,优于Leiden
Abstract
空间转录组学现在是一项成熟的技术,可以在复杂组织的组织学背景下分析基因表达的变化。典型分析工作流程从识别具有相似表达谱的组织区域开始,然后检测高度可变或空间可变的基因。空间转录组数据集规模和复杂性的快速增加要求以一致和集成的方式进行这些分析步骤,当前很多方法无法满足这一要求。为了解决这个问题,作者开发了SpatialDE2,它将组织区域的映射和空间高变基因检测统一为集成软件框架,同时推进这两个步骤的当前算法。该模型在贝叶斯框架中制定,考虑了泊松计数噪声,同时与以前的方法相比提供了卓越的计算速度。使用模拟数据验证了 SpatialDE2,并证明了其在实际中的运用价值。
Introduction
空间转录组学技术目前获得了极大的关注,因为其能够探索细胞局部环境中的细胞身份和功能。 在快速技术发展的推动下,空间转录组现在允许并行研究成百上千个基因。 基于多重成像的技术,如 SeqFISH+ 或 MERFISH,允许同时检测数百个基因参数,提供亚细胞分辨率。 基于 RNA 测序 (RNAseq) 的方法使用空间条形码引物,并允许对少数细胞进行分辨率并接近单细胞分辨率。 这些方法包括空间转录组学、HDST 和 Slide-seq,分辨率为少数细胞,接近单细胞分辨率。 特别是,空间转录组学的改进版本 10x Visium 由于其商业可用性和易用性而在社区中得到广泛采用(大家用到的空间转录组大部分都是10X公司的)。
虽然最初的研究集中在相对均匀的小组织切片中的概念验证应用,但这些技术越来越多地应用于具有不同结构和区域的复杂组织或器官,包括人脑、子宫内膜和肿瘤微环境。这些数据允许解决一系列不同的问题,空间组学分析工作流程的典型起点是识别空间可变基因。此步骤产生与下游分析最相关的基因,例如组织中局部生态位的定义支持细胞分化和功能(这也是我特别强调大家重视空间高变基因的原因),但关于空间变异性的知识本身通常可以提供生物学见解,例如变成癌症。然而,现有的空间可变基因检测方法通常考虑空间转录组学数据集中的整个视野。随着空间转录组学的视野不断扩大,并且这些方法被应用于由不同细胞类型组成的区域组成的日益复杂的组织,单纯地应用空间可变基因检测不再能产生相关的见解,因为已识别的空间可变基因组主要是包含并非固有空间可变的细胞类型标记。因此,空间方差分析需要结合合适的计算方法来识别组织区域(说的很有道理)。
空间可变基因检测和组织区域识别都在该领域受到关注。例如,SpatialDE1是最早检测空间可变基因的计算解决方案之一,最近已通过Leiden 模型对其进行了改进。两种模型都基于非参数高斯过程 (GP) 回归,SPARK 还提供基于计数的可能性和更强大的统计测试。同样基于 GP 回归的 SVCA 和基于距离加权非参数回归的 scHOT 扩展了空间变量基因检测的原理,通过考虑相邻细胞或voxels之间的相互作用,提供精度更高的空间基因表达变异分解。然而,所有这些方法的共同点是它们不能扩展到大型数据集,部分原因是受 CPU 限制的实现无法利用现代高度并行化的 GPU 架构,部分原因是算法效率低下。同样,存在一系列聚类方法,但它们不太适合识别空间基因表达数据中的组织区域。 Scanpy 和 Seurat 是两个广泛使用的 scRNA-seq 分析框架,推荐使用 Leiden 聚类从空间组学中识别组织区域,这是一种不知道空间关系的算法。 Giotto 为此任务提出了一种基于隐马尔可夫随机场的方法,从而强加了空间平滑性约束。然而,该模型要求用户预先指定组织区域的数量,并且它采用了对计数数据来说不是最佳的高斯似然模型。最后,注意到缺乏能够结合空间变量基因选择和组织区域识别的集成软件和工作流程。
基于这方面的确实,作者开发了SpatialDE2,一个用于建模空间转录组学数据的灵活框架。 SpatialDE2 实现了两个主要模块,这两个模块共同提供了用于分析空间转录组学数据的end-to-end工作流程:组织区域分割模块和用于检测空间可变基因的模块。与以前的方法相比,组织区域分割速度快,并提供了改进的可用性。特别地,该模块能够在采用适当的基于计数的可能性的同时自动确定组织区域的数量。用于检测空间可变基因的模块通过提供技术创新和计算加速扩展了先前的方法,例如 SpatialDE 和 SVCA。文章中使用模拟数据和对两个真实世界数据集的应用来验证 SpatialDE2 的分割和空间变量基因检测模块,其中包含小鼠大脑和人类子宫内膜的 10x Visium 数据。在这些应用中,软件展示了与以前的方法相比,两个模块的速度和鲁棒性都有所提高。 SpatialDE2 为评估组织区域内的空间表达异质性提供了一个集成的解决方案,从而为处理复杂组织和大样本提供了原则性的策略。
SpatialDE2算法原理
SpatialDE2 通过实现两个无缝集成的分析模块:组织区域分割模块和空间可变基因检测模块,实现了用于表征子区域空间异质性的end-to-end工作流程(下图)。
- 注:Top: SpatialDE2 accepts input spatial transcriptome profiles from a tissue sample, either in the form of a raw gene count matrix, or using cell type counts as provided from a computational deconvolution step (e.g. cell2location). Bottom: input data processing workflow.
这两个模块都直接对原始 mRNA 计数进行建模,这些计数是从多路分解的空间转录组学工作流程中获得的,或作为输入的成像技术。 可选地,SpatialDE2 还可以对从附加解卷积步骤获得的细胞计数估计值进行操作(这里用到的是cell2location)。
简而言之,空间组织区域分割模块基于贝叶斯隐马尔可夫随机场,将组织分割成不同的组织学区域,同时明确考虑相邻位置之间的空间平滑度(下图)
- 注:SpatialDE2 segments the input tissue into connected and transcriptionally similar regions. Top: Schematic output of the segmentation with colours denoting identified tissue regions. Bottom: implementation of the spatial segmentation based on a Poisson Hidden Markov Random field to encode the assumption of spatial smoothness. Nodes correspond to locations with colour denoting the region label. The number of regions is determined by the model.
与当前最广泛使用的基于Leiden聚类的算法相比,SpatialDE2方法的主要优势是双重的。 首先,与以前的方法不同,SpatialDE2 在分割步骤中明确说明了空间信息。 其次,SpatialDE2 为使用原则性贝叶斯方法实现的组织分割提供了一个连贯模型,该模型需要一个用户定义的参数,该参数对空间平滑度的先验假设进行编码。 相比之下, Leiden 聚类方法的结果会受到手动定义处理步骤相互作用的非直观影响,每个处理步骤都由多个参数决定。 Leiden聚类工作流包括原始数据归一化、降维、k-最近邻搜索和 Leiden 聚类算法本身的应用。
The spatially variable gene detection module models variance components of individual genes within identified regions using an appropriate count-based likelihood.。 虽然 SpatialDE2 建立在高斯过程回归的基础上,也被 SpatialDE、SVCA 和 SPARK 等方法使用,但该模型通过提供新颖的特征、改进的可扩展性、支持多个方差分量和 GPU 计算来概括这些方法。 核心部分,该模型将表达变异分解为不同的结构化成分和一个模拟随机变异性的噪声项(下图,我还是特别推荐有能力的童鞋多多研究数学算法,不然最根本的原理完全不明白)。
- 注:SpatialDE2 models spatial variance components of individual genes within tissue regions. Top: Schematic for the identification of spatial variance components of individual genes in specific tissue regions. Bottom: SpatialDE2 models expression variation within tissue regions by partitioning gene expression variation into one or multiple functional components (U1,..,Un). Each component is characterized by a covariance matrix that is parametrized by spatial or non-spatial covariates. The special case of spatially variable gene selection corresponds to a functional component parametrized by distance between locations.
根据一个或多个协方差矩阵的设计,可以将不同的测试形式化,以量化空间可变基因,或识别受细胞-细胞相互作用调节的基因。
最后,注意到 SpatialDE2 提供了下游分析工具来帮助解释。 这包括用于识别空间共变的基因。 AEH 将特定基因的表达建模为来自定义数量的平滑空间模式之一,并尝试使用贝叶斯框架估计模式和基因对模式的分配。
- 注:(D) Run time of SpatialDE2’s tissue region segmentation module and a Leiden clustering workflow for semi-synthetic dataset of increasing size and when using alternative compute environments. Clustering/segmentation was based on 2,000 genes. Leiden denotes the scanpy workflow. Only SpatialDE2 supports GPU computations. (E) Run time of SpatialDE2’s spatially variable gene selection module versus alternative methods. Considered is a dataset consisting of 200 genes for increasing numbers of locations and alternative computing environments. Only SpatialDE2 supports GPU computations.
Model validation using simulated data
首先,在无空间变量基因表达的零假设下使用模拟数据来确认空间变量基因检测模块的统计校准。 简而言之,与SpatialDE1和SPARK 类似,SpatialDE2 increases the sensitivity of its spatially variable gene detection by testing multiple kernel matrices for each gene. SpatialDE2 implements two strategies to estimate statistical significance: Each kernel matrix is tested separately and the p-values are combined using the Cauchy combination,or all kernel matrices are tested simultaneously using an omnibus test。 后一种选择更快,因为只进行了一次测试。 分析确认这两种策略都产生了校准结果。
接下来,模拟了真实空间可变基因,调整了来自 10x Visium 小鼠大脑数据集的经验参数。 我们评估了 SpatialDE2、SpatialDE和SPARK 的灵敏度(统计性能)以检测真正的空间可变基因。SpatialDE的灵敏度最低,而 SpatialDE2 和SPARK 的结果相当。 默认情况下,SPARK 通过执行内核矩阵的特征分解并将负特征值设置为零来强制内核的正定性。 由于其默认的周期内核不是正定的,这会导致内核定义不明确且无法解释。 因此,还在基准测试中包含了没有特征值裁剪的SPARK ,分析得到一致的结果。
Tissue region segmentation recovers known histological features along a continuous resolution gradient
接下来,测试来自小鼠大脑的 10x Visium 数据,以评估 SpatialDE2 和组织分割的替代方法。小鼠大脑非常适合此类基准测试目的,因为它具有明确定义和注释良好的不同区域。 使用空间平滑度的默认设置,SpatialDE2 正确解析了主要的大脑区域(下图),特别是将海马锥体/颗粒细胞与周围的海马结构(区域 0)分开并识别了侧脑室(区域 3)。 为了进行比较,还考虑了应用于相同输入数据的Leiden聚类工作流程(方法)。 从Leiden聚类获得的分割解决方案未能解决一些预期的大脑区域,特别是海马锥体/颗粒细胞和侧脑室都没有被识别。
- 注:(A) Clustering of mouse brain Visium data with SpatialDE2 using default parameters (left), SpatialDE2 with adjusted parameters and clustering results obtained using the Leiden workflow (right).Identified tissue regions are annotated in colour. To avoid clutter, only selected regions that are referred to in the main text are labeled in the legend (out of 18 in total) . (B) Corresponding reference annotation of the mouse brain regions obtained from the Allen Brain Atlas
为了提高空间分辨率,调整了 SpatialDE2 的空间平滑度参数和Leiden聚类的分辨率参数。 值得注意的是,SpatialDE2 在广泛的平滑参数设置中产生了一致的分割,在低精度和高精度解决方案之间提供了有意义的interpolation。 正如预期的那样,具有较低平滑度参数的分割解决了等皮质层以及胼胝体(上图A),并进一步将海马锥体和颗粒细胞层分成不同的区域。 另一方面,当改变Leiden工作流程的分辨率参数时,结果变化很大并且缺乏直观的过渡。 此外,不管分辨率参数的设置如何,Leiden算法既不能分辨胼胝体,也不能分辨海马锥体/颗粒细胞。
对 SpatialDE2 在转录组距离背景下识别的区域的分析表明,该模型获得的解决方案在空间平滑度和转录相似性之间取得了适当的平衡。 特别是,如果这些区域在转录上非常相似,则模型将相同的聚类标签分配给空间不同的区域。 这在海马结构和等皮质的第 1 层的情况下尤为明显,它们被 SpatialDE2 分配到同一cluster。
还通过在各个位置引导 RNA-seq 读数来评估 SpatialDE2 和Leiden聚类的稳健性(下图C),并且评估了两种方法对测序深度变化的敏感性(下图D)。 总的来说,这些实验表明 SpatialDE2 至少与Leiden 工作流程一样强大。
- 注:(C) Assessment of the robustness of tissue region segmentations obtained using SpatialDE2 or Leiden clustering. Shown is the concordance across 100 bootstrap experiments, sampling sequencing reads with replacement from the full dataset (N=4.5·107 reads total). Bar height denotes the mean split/join distance compared to the results obtained on the full dataset; error bars denote plus or minus one standard deviation. (D) Assessment of the robustness of the SpatialDE2 tissue region segmentation to downsampling of sequencing reads (out of N=45x10^6 reads total). The total number of sequencing reads in the mouse brain dataset was downsampled to the indicated amount. Shown are the segmentation results.
最后,使用 SpatialDE2 空间可变基因检测模块来测试已识别组织区域内的空间可变基因。 这鉴定了多达 867 个空间可变基因(FDR<0.1%;下图E),这些基因丰富了与神经元和神经组织相关的 GO terms,例如“神经系统发育”或“突触”。 几个单独的空间可变基因在特定组织区域中具有已知的功能,例如突触结合蛋白、突触蛋白、促甲状腺激素释放激素和丘脑中的血管紧张素原(下图F)。 总的来说,这些结果表明,即使在看似同质的组织区域内,在个体基因和通路水平上也存在表达变异的空间模式,这可能有助于在更精细的空间尺度上理解组织功能。
- 注:(E) Results from the identification of spatially variable genes within the tissue regions identified from the segmentation step as in (A). Bar heights denote the number of spatially variable genes within each region (FDR<0.1%; out of 12,682 genes assessed). (F) Selected examples of spatially variable genes identified in the thalamus. Shown is normalized relative expression of the indicated genes.
SpatialDE2 enables fine-grained analyses in complex human tissue
接下来,为了在更复杂的数据中测试 SpatialDE2,我们将该模型应用于来自人类子宫内膜的 Visium 载玻片,这是一种高度分隔的组织,其中细胞身份取决于刚刚开始全面表征的空间环境。 最初,再次应用 SpatialDE2 和Leiden聚类来分割组织区域(下图)
- 注:Comparison of tissue segmentation obtained using SpatialDE2 (left) and Leiden clustering (right). SpatialDE2 was used with default settings, and the Leiden resolution parameter was selected to yield a biologically meaningful clustering for results using alternative parameters). Myometrium LQ denotes a region of the myometrium with low data quality。
为了注释使用 SpatialDE2 识别的区域,结合了来自组织组织学的线索和从参考细胞类型的计算分配中获得的细胞类型注释(使用 cell2location);这确定了高精度组织亚结构的生理学上有意义的注释。例如,SpatialDE2 识别了区域对应于腺体及其周围区域。在这两个区域中,腺上皮细胞和成纤维细胞都存在,但比例不同。这些观察到的细胞类型组成的逐渐变化是预期的,反映了 Visium 平台的低分辨率(10 到 50每个位置的细胞。手动检查腺细胞的典型标志物,包括 EPCAM(一种上皮标志物)和 PAEP(一种腺体细胞的典型分泌标志物),证实了这些细胞类型注释。同样,肌肉细胞标志物 ACTA2 和 MYLK 主要是在子宫肌层中表达,从而提供额外的信心 SpatiaIDE2 鉴定了具有生理意义的组织区域。
与 SpatialDE2 相比,Leiden聚类工作流程产生的结果与生物学预期不一致。 例如,识别出的子宫肌层区域远不如 SpatialDE2 的平滑,并且功能不同的子宫区域被分配到同一簇:子宫内膜基底层的一部分与子宫肌层的一部分聚集在一起,基底层的另一部分聚集在一起 部分功能层,腺上皮与管腔上皮聚集在一起(上图),至关重要的是,虽然调整Leiden分辨率参数导致某些区域的部分改善 - 提高分辨率会产生更细粒度的基底层和功能性子宫内膜层聚集,并分离管腔和腺上皮 - 这种调整同时导致子宫肌层过度聚集.
接下来,着手研究单个组织区域内的空间基因表达。 SpatialDE2 测试确定了每个区域 53 到 409 个空间可变基因(FDR<0.001,下图)
- 注:Number of spatially variable genes within each region (FDR<0.01%).
GO 富集分析确定了几个与金属离子相关的 GO terms,这些terms用于上皮中的空间可变基因
- 注:GO enrichment analysis of spatially variable genes in the named regions of C). GO terms containing metal-binding genes are highlighted. To avoid clutter, only the GO IDs are shown for the other terms.
这些terms中包含的八个空间可变基因中有七个是金属硫蛋白,它们几乎只在一小群管腔上皮spot中表达
- 注:Normalized relative expression of metallothioneins in the functional layer.
管腔上皮富含纤毛细胞,其特性取决于卵巢激素雌激素和孕激素。 金属硫蛋白是多种组织中管腔上皮的一个特征,先前已被证明受孕酮调节。
然后,尝试使用 SpatialDE2 的 AEH 模块将基因分组为表达模式。 这种方法是对将组织分割成离散组织区域的补充,特别是允许识别平滑过渡。 将 AEH 应用于除被归类为子宫肌层的那些点之外的所有点。 排除子宫肌层有两个原因:它是一个相对均质的组织,低质量区域可能会干扰分析。 这确定了子宫内膜中的五种空间模式。 其中四种模式似乎概括了功能层中腺体的分布,但一种模式(模式 1)明显不同,并且包含在功能层最内腔部分高表达的基因,就在上皮下方
- 注:Normalized relative expression of the progesterone receptor PGRMC1 and other genes belonging to the same spatial pattern (top right)
这些基因中有孕酮受体 PGRMC1,这与子宫内膜的功能层响应激素而分化(蜕膜化)的事实一致。
SpatialDE2 设计将空间组学概况作为输入,从而使模型适用于一系列不同的技术,这些技术产生的表达估计可以通过基于计数的可能性来解释。这既包括基于测序的分析,也包括成像技术。然而,特别是对于 Visium 或其他无法实现真正单细胞分辨率的技术,相对粗略的分割可能会导致将特定组织分配给集群的限制,如具有“腺体(边界)”集群的子宫内膜数据集中所见.受此限制的启发,因此着手探索如何将 SpatialDE2 与从利用参考 scRNA-seq 数据集估计细胞类型丰度的计算反卷积工作流程中获得的细胞类型计数估计相结合。具体来说,将 SpatialDE2 应用于通过 cell2location 在子宫内膜数据集上获得的细胞类型丰度估计。与将 SpatialDE2 直接应用于基因计数相比,这导致了更高精度的分割。
- 注:Comparison of tissue segmentation obtained using SpatialDE2 (left) and Leiden clustering (right) based on cell2location cell type abundance estimates. Number of clusters was determined automatically for SpatialDE2, whereas the Leiden resolution parameter was tuned by hand to yield a biologically meaningful clustering
例如,SpatialDE2 现在将基底层分成三个不同的区域(图中编号为 1,5 和 8)。 属于这些区域的spot也以基因表达空间中的不同cluster为特征
- 注:UMAP embedding of individual locations based on their expression profiles. Color denotes the cluster labels assigned to one of three basal layer clusters by SpatialDE2.
表明这些区域之间存在相关的表型差异。 一致地,观察到这些区域之间细胞类型组成的显着差异。 例如,区域 8 富含上皮腺细胞,而区域 1 具有更多的血管周围 STEAP4 细胞
- 注:Cell type composition of the three basal layer clusters as in (B). Shown are estimates of cell type fractions from cell2location across individual locations assigned to either cluster 1, 5 or 8
一般来说,大多数区域显示出独特的细胞类型组成,允许对组织组织学进行细粒度解剖。 最后注意到,如果应用于 cell2location 输出并且错误地将上皮和腺体以及部分功能层和基底层组合在一起,Leiden 聚类工作流程再次表现不佳,同时严重过度聚类子宫肌层。
Discussion
在这里,展示了 SpatialDE2,一个用于分析空间转录组学数据的多功能工具箱。 该方法的核心是两个分析模块——区域分割和空间可变基因识别。 与现有方法相比,这两个模块都具有重要的优势。 此外,这些模块是兼容的,可以在一个连贯的软件工作流程中有效组合。
区域分割模块独特地考虑了组织的空间结构,直接对 RNA 计数数据进行操作,并且可以通过单个直观参数进行调整。 同时,该模块与忽略空间坐标的传统 Leiden 聚类一样有效。 此外,SpatialDE2 提供 GPU 加速计算选项。 类似地,空间可变基因方向模块提供比以前的方法(如 SpatialDE 或 SPARK)快几个数量级的计算,同时提供可比较或更好的统计能力。 SpatialDE2 还保留了这些基于内核的方法的显着特征,例如通过设计相应的内核来测试特定类型的表达模式的能力
这两个模块同时进行,可以有效实施工作流程,该工作流程包括首先分割组织切片,然后检测每个区域内的空间可变基因。 在小鼠大脑上展示了这个工作流程,这是一种适合评估性能的特征明确的组织。 在这个数据集上,组织区域分割的表现与Leiden 聚类工作流程相当,能够恢复已知的生物学,例如主要的大脑区域。 还将 SpatialDE2 工作流程应用于人类子宫内膜数据集,这是一种表征较少的组织。 在这里,组织区域分割证明优于Leiden 聚类。 除了概括已知的生物学之外,SpatialDE2 还对这种复杂组织产生了新的见解,例如金属硫蛋白的表达模式和与孕酮受体在空间上共表达的几个基因。
最后,展示了如何将 SpatialDE2 应用于基于参考表达谱的空间转录组学计算反卷积的下游,从而产生具有更细粒度结果的组织区域分割,从而实现组织组织学的详细解剖。
最后看一下示例代码
import SpatialDE
import numpy as np
import scipy
import pandas as pd
import scanpy as sc
import anndata as ad
from tqdm.auto import trange, tqdm
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('png')
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['svg.fonttype'] = 'none'
mpl.rc('font', **{'family':'sans-serif','sans-serif':['Helvetica']})
from plotnine import *
import mizani
theme_set(theme_bw())
theme_update(legend_key=element_blank(), text=element_text(family="Helvetica"))
data = sc.read_visium("mouse_brain_visium_wo_cloupe_data/rawdata/ST8059048")
data.var_names_make_unique()
data.var["mt"] = data.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(data, qc_vars=["mt"], inplace=True)
sc.pp.filter_cells(data, min_counts=4000)
data = data[data.obs["pct_counts_mt"] < 20]
sc.pp.filter_genes(data, min_cells=100)
svg_full, _ = SpatialDE.test(data, omnibus=True)
svg_full["total_counts"] = np.asarray(data.X.sum(axis=0)).squeeze()
svg_full.to_pickle("ST8059048_svg_full.pkl")
svg_full = pd.read_pickle("ST8059048_svg_full.pkl")
vargenes = svg_full[svg_full.padj < 0.001].sort_values("total_counts", ascending=False).gene[:2000]
segm_default, _ = SpatialDE.tissue_segmentation(data, genes=vargenes, rng=np.random.default_rng(seed=42))
data.obs["segmentation_labels_default"] = data.obs.segmentation_labels
segm_adj, _ = SpatialDE.tissue_segmentation(data, genes=vargenes, rng=np.random.default_rng(seed=42), params=SpatialDE.TissueSegmentationParameters(smoothness_factor=5))
with rc_context({"figure.figsize": (2, 2)}):
sc.pl.spatial(data, color="segmentation_labels_default", title="SpatialDE2", return_fig=True).savefig("figures/3a_default.svg")
sc.pl.spatial(data, color="segmentation_labels", title="SpatialDE2 (adjusted)", return_fig=True).savefig("figures/3a_adjusted.svg")
data_normalized = data.copy()
sc.pp.normalize_total(data_normalized, target_sum=1e4, key_added="scaling_factor")
data_norm_reg = data_normalized.copy()
sc.pp.log1p(data_norm_reg)
variable = sc.pp.highly_variable_genes(data, flavor="seurat_v3", inplace=False, n_top_genes=2000)
data_norm_reg = data_normalized.copy()
sc.pp.log1p(data_norm_reg)
data_norm_reg = data_norm_reg[:, vargenes]
sc.pp.regress_out(data_norm_reg, ['total_counts'])
sc.pp.scale(data_norm_reg, max_value=10)
sc.tl.pca(data_norm_reg, n_comps=100)
sc.pp.neighbors(data_norm_reg, n_neighbors=20, n_pcs=80)
sc.tl.leiden(data_norm_reg, random_state=42)
sc.tl.umap(data_norm_reg)
with rc_context({"figure.figsize": (2, 2)}):
sc.pl.spatial(data_norm_reg, color="leiden", title="Leiden", return_fig=True).savefig("figures/3a_leiden.svg")
sc.pl.umap(data_norm_reg, size=3, color=["segmentation_labels_default", "segmentation_labels", "leiden"], title=["SpatialDE2", "SpatialDE2 (adjusted)", "Leiden"], return_fig=True, ncols=3).savefig("figures/s3c.svg")
resolutions = (0.1, 0.5, 1., 1.5, 2., 3., 4.)
smoothnesses = (0.5, 1., 2., 3., 5., 8., 10.)
data_norm_reg.obs["leiden_1.0"] = data_norm_reg.obs.leiden
with rc_context({"figure.figsize": (2, 2)}):
for res in resolutions:
sc.tl.leiden(data_norm_reg, random_state=42, resolution=res, key_added=f"leiden_{res}")
sc.pl.spatial(data_norm_reg, color=f"leiden_{res}", return_fig=True, title=f"Leiden ({res})").savefig(f"figures/s3b_{res}.svg")
testdata = data.copy()
with rc_context({"figure.figsize": (2, 2)}):
for smoothness in smoothnesses:
segm_adj, _ = SpatialDE.tissue_segmentation(testdata, genes=vargenes, rng=np.random.default_rng(seed=42), params=SpatialDE.TissueSegmentationParameters(smoothness_factor=smoothness))
sc.pl.spatial(testdata, color=f"segmentation_labels", return_fig=True, title=f"SpatialDE2 ({smoothness})").savefig(f"figures/s3a_{smoothness}.svg")
nresamples = 100
totals = (20e6, 10e6, 5e6, 1e6)
obs = data.obs.copy()
for i in trange(nresamples):
bdata = data.copy()
bdata.X.data = np.random.default_rng(seed=i).multinomial(bdata.X.data.sum(), bdata.X.data.astype(np.uint16) / bdata.X.data.astype(np.uint16).sum())
bdata.obs.total_counts = bdata.X.sum(axis=1)
segm, _ = SpatialDE.tissue_segmentation(bdata, genes=vargenes, rng=np.random.default_rng(seed=42))
obs[f"segmentation_labels_bootstrap_{i}"] = bdata.obs.segmentation_labels
sc.pp.normalize_total(bdata, target_sum=1e4, key_added="scaling_factor")
bdata = bdata[:, vargenes]
sc.pp.regress_out(bdata, ['total_counts'])
sc.pp.scale(bdata, max_value=10)
sc.tl.pca(bdata, n_comps=100)
sc.pp.neighbors(bdata, n_neighbors=20, n_pcs=80)
sc.tl.leiden(bdata, random_state=42)
obs[f"leiden_bootstrap_{i}"] = bdata.obs.leiden
for total in totals:
cdata = sc.pp.downsample_counts(data, total_counts=total, copy=True, random_state=i)
cdata.obs.total_counts = cdata.X.sum(axis=1)
for smoothness in smoothnesses:
segm, _ = SpatialDE.tissue_segmentation(cdata, genes=vargenes, rng=np.random.default_rng(seed=42), params=SpatialDE.TissueSegmentationParameters(smoothness_factor=smoothness))
obs[f"segmentation_labels_{total:.0e}_{i}_{smoothness}"] = cdata.obs.segmentation_labels
sc.pp.normalize_total(cdata, target_sum=1e4, key_added="scaling_factor")
cdata = cdata[:, vargenes]
sc.pp.regress_out(cdata, ['total_counts'])
sc.pp.scale(cdata, max_value=10)
sc.tl.pca(cdata, n_comps=100)
sc.pp.neighbors(cdata, n_neighbors=20, n_pcs=80)
for res in resolutions:
sc.tl.leiden(cdata, random_state=42, resolution=res)
obs[f"leiden_{total:.0e}_{i}_{res}"] = cdata.obs.leiden
obs.to_pickle("data_obs.pkl")
data.write_h5ad("data.h5ad")
data = ad.read_h5ad("data.h5ad")
obs = pd.read_pickle("data_obs.pkl")
data.obs = obs
def split_join_metric(obs1, obs2):
cl1 = np.unique(obs1)
cl2 = np.unique(obs2)
cp1 = {c: 0 for c in cl1}
cp2 = {c: 0 for c in cl2}
for ccl1 in cl1:
cobs1 = obs1 == ccl1
for ccl2 in cl2:
intersect = np.sum(cobs1 & (obs2 == ccl2))
cp1[ccl1] = max(cp1[ccl1], intersect)
cp2[ccl2] = max(cp2[ccl2], intersect)
p1 = sum(cp1.values())
p2 = sum(cp2.values())
return 2 * len(obs1) - p1 - p2
data.obs["leiden"] = data_norm_reg.obs.leiden
dists = (pd.concat([
pd.DataFrame.from_records([(i, split_join_metric(data.obs.segmentation_labels_default, data.obs[f"segmentation_labels_bootstrap_{i}"]), "SpatialDE") for i in range(nresamples)], columns=["resample", "dist", "method"]),
pd.DataFrame.from_records([(i, split_join_metric(data.obs.leiden, data.obs[f"leiden_bootstrap_{i}"]), "Leiden") for i in range(nresamples)], columns=["resample", "dist", "method"])
], axis=0).groupby("method", as_index=False).
agg(y=("dist", np.mean),
ymin=("dist", lambda x: x.mean() - x.std()),
ymax=("dist", lambda x: x.mean() + x.std()))
)
(
ggplot(
dists,
aes(x="method", y="y", ymin="ymin", ymax="ymax")) +
geom_bar(stat='identity') +
geom_errorbar() +
labs(x=None, y="split / join distance") +
theme(figure_size=(2,2))
).save("figures/3c.svg")
dists
dists_spatialde = pd.DataFrame.from_records([(total, i, sm, split_join_metric(data.obs.segmentation_labels_default, data.obs[f"segmentation_labels_{total:.0e}_{i}_{sm}"])) for total in totals for i in range(nresamples) for sm in smoothnesses], columns=["depth", "resample", "smoothness", "dist"])
mean_dists = dists_spatialde.groupby(["depth", "smoothness"], as_index=False).agg(mean_dist=("dist", np.mean))
spatialde_optimal = mean_dists.loc[mean_dists.groupby("depth")["mean_dist"].idxmin()]
dists_leiden = pd.DataFrame.from_records([(total, i, sm, split_join_metric(data.obs.leiden, data.obs[f"leiden_{total:.0e}_{i}_{sm}"])) for total in totals for i in range(nresamples) for sm in resolutions], columns=["depth", "resample", "res", "dist"])
mean_dists = dists_leiden.groupby(["depth", "res"], as_index=False).agg(mean_dist=("dist", np.mean))
leiden_optimal = mean_dists.loc[mean_dists.groupby("depth")["mean_dist"].idxmin()]
def label(x):
if x == 0:
return "0.0"
exp = np.floor(np.log10(x))
factor = x / (10 ** exp)
return f"${factor:.1f}\\! \\times\\! 10^{{{int(exp)}}}$"
dists_spatialde = dists_spatialde.set_index(["depth", "smoothness"]).loc[spatialde_optimal.set_index(["depth", "smoothness"]).index].reset_index()
dists_leiden = dists_leiden.set_index(["depth", "res"]).loc[leiden_optimal.set_index(["depth", "res"]).index].reset_index()
dists_spatialde["method"] = "SpatialDE2"
dists_leiden["method"] = "Leiden"
(ggplot(
pd.concat((dists_spatialde, dists_leiden), axis=0).groupby(["depth", "method"], as_index=False).agg(y=("dist", np.mean),
ymin=("dist", lambda x: x.mean() - x.std()),
ymax=("dist", lambda x: x.mean() + x.std())),
aes("depth", "y", ymin="ymin", ymax="ymax", color="method")) + geom_line() + geom_errorbar(width=5e5) +
labs(x="sequencing depth", y="split/join distance") +
scale_x_continuous(labels=lambda x: [label(i) for i in x]) +
theme(figure_size=(2,2), axis_text_x=element_text(rotation=20), legend_title=element_blank())
).save("figures/s3e.svg")
spatialde_optimal
leiden_optimal
with rc_context({"figure.figsize": (2, 2)}):
for total, optimal in zip(totals, (3., 3., 3., 0.5)):
sc.pl.spatial(data, color=f"segmentation_labels_{total:.0e}_0_{optimal}", return_fig=True, title=f"SpatialDE2 ({total:.0e} reads)").savefig(f"figures/3d_{total:.0e}.svg")
with rc_context({"figure.figsize": (2, 2)}):
for total, optimal in zip(totals, (1., 1., 1., 0.5)):
sc.pl.spatial(data, color=f"leiden_{total:.0e}_0_{optimal}", return_fig=True, title=f"Leiden ({total:.0e} reads)").savefig(f"figures/s3d_{total:.0e}.svg")
ulabels, lcounts = np.unique(data.obs.segmentation_labels_default, return_counts=True)
svg_default = []
for l in ulabels[lcounts > 10]:
csvg, _ = SpatialDE.test(data[data.obs.segmentation_labels_default == l, :], omnibus=True)
csvg["label"] = l
svg_default.append(csvg)
svg_default = pd.concat(svg_default, axis=0, ignore_index=True)
svg_default.to_pickle("ST8059048_svg_default.pkl")
ulabels, lcounts = np.unique(data.obs.segmentation_labels, return_counts=True)
svg = []
for l in ulabels[lcounts > 10]:
csvg, _ = SpatialDE.test(data[data.obs.segmentation_labels == l, :], omnibus=True)
csvg["label"] = l
svg.append(csvg)
svg = pd.concat(svg, axis=0, ignore_index=True)
svg.to_pickle("ST8059048_svg.pkl")
svg = pd.read_pickle("ST8059048_svg.pkl")
(
ggplot(svg.assign(region=lambda x: pd.Categorical(x.label))[svg.padj < 0.001], aes("region")) +
geom_bar() +
ylab("spatially variable genes") +
theme(axis_text_x=element_text(rotation=20, ha="right"), figure_size=(4, 2))
).save("figures/3e.svg")
from bioservices import BioMart
import goatools
from io import StringIO
s = BioMart()
s.add_dataset_to_xml("mmusculus_gene_ensembl")
s.add_attribute_to_xml("ensembl_gene_id")
s.add_attribute_to_xml("external_gene_name")
s.add_attribute_to_xml("entrezgene_accession")
s.add_attribute_to_xml("entrezgene_id")
s.add_filter_to_xml("ensembl_gene_id", ",".join(data.var.gene_ids.to_numpy()))
res = s.query(s.get_xml())
res = pd.read_table(StringIO(res), names=["ensembl_gene_id", "gene_name", "entrezgene_accession", "entrezgene_id"], dtype={"entrezgene_id": pd.Int32Dtype()})
def filter_entrez(group):
group = group.drop("ensembl_gene_id", axis=1)
if group.shape[0] == 1:
ret = group
else:
matched = group[group.gene_name == group.entrezgene_accession]
if matched.shape[0] > 0:
ret = matched.iloc[[0],:] # no clear way to choose if there are multiple matches, so I just choose the first one
else:
ret = matched
return ret.drop("entrezgene_accession", axis=1)
res = res.dropna().reset_index(drop=True).groupby("ensembl_gene_id").apply(filter_entrez)
data.var["entrez_id"] = (
data.var.
reset_index().
rename(columns={"index":"gene_name", "gene_ids": "ensembl_gene_id"}).
merge(
res.assign(entrezgene_id=res.entrezgene_id.astype(pd.Int32Dtype())),
on=["gene_name", "ensembl_gene_id"]).
set_index("gene_name").
rename_axis(index=None)["entrezgene_id"]
from goatools.base import download_go_basic_obo, download_ncbi_associations
from goatools.obo_parser import GODag
from goatools.anno.genetogo_reader import Gene2GoReader
obo_fname = download_go_basic_obo()
fin_gene2go = download_ncbi_associations()
obodag = GODag(obo_fname)
objanno = Gene2GoReader(fin_gene2go, taxids=[10090])
ns2assoc = objanno.get_ns2assc()
background_genes = data.var.entrez_id.to_numpy()
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
goeaobj = GOEnrichmentStudyNS(background_genes, ns2assoc, obodag, propagate_counts=False, alpha=0.01, methods=['fdr_by'])
go_enrichment = []
for label, g in svg.groupby("label"):
enrichment = goeaobj.run_study(data.var.entrez_id[g.gene[g.padj < 0.001]].to_numpy().tolist(), prt=None)
res = []
for term in enrichment:
res.append({
"ont": term.NS,
"go": term.GO,
"desc": term.name,
"study_count": term.study_count,
"population_count": term.pop_count,
"study_ratio": np.divide(*term.ratio_in_study),
"population_ratio": np.divide(*term.ratio_in_pop),
"pval": term.p_uncorrected,
"padj": term.p_fdr_by,
"label": label,
"genes": data.var.query("entrez_id in @term.study_items").index.to_numpy()
})
go_enrichment.append(pd.DataFrame(res))
go_enrichment = pd.concat(go_enrichment, axis=0, ignore_index=True).assign(label=lambda x: pd.Categorical(x.label))
(
ggplot(go_enrichment[(go_enrichment.padj < 1e-3) & (go_enrichment.ont == "CC")], aes("label", "desc", color="-np.log10(pval)", size="study_ratio")) +
geom_point() +
facet_wrap("ont", ncol=1, scales="free") +
scale_size_area(breaks=[0.1, 0.2, 0.3, 0.4, 0.5], limits=(0.1, 0.5), oob=mizani.bounds.squish, name="fraction of sample") +
scale_color_distiller(palette="YlGnBu", direction=1, limits=(0, 25), oob=mizani.bounds.squish, name=r"$-\log_{10}(p)$") +
theme(figure_size=(2,7), subplots_adjust={'hspace': 0.1})
).save("figures/s3f_1.svg")
(
ggplot(go_enrichment[(go_enrichment.padj < 1e-3) & (go_enrichment.ont != "CC")], aes("label", "desc", color="-np.log10(pval)", size="study_ratio")) +
geom_point() +
facet_wrap("ont", ncol=1, scales="free") +
scale_size_area(breaks=[0.1, 0.2, 0.3, 0.4, 0.5], limits=(0.1, 0.5), oob=mizani.bounds.squish, name="fraction of sample") +
scale_color_distiller(palette="YlGnBu", direction=1, limits=(0, 25), oob=mizani.bounds.squish, name=r"$-\log_{10}(p)$") +
theme(figure_size=(2,7), subplots_adjust={'hspace': 0.2})
).save("figures/s3f_2.svg")
with rc_context({"figure.figsize": (2, 2)}):
for gene in ("Syt7", "Syngr3", "Trh", "Agt"):
sc.pl.spatial(data_normalized[data.obs.segmentation_labels == 5, :], color=gene, return_fig=True).savefig(f"figures/3f_{gene}.svg")
with pd.ExcelWriter("figures/table_s_mouse_brain.xlsx") as ew:
svg_full[["gene", "pval", "padj"]].to_excel(ew, sheet_name="SV genes whole slice", index=False)
svg[["gene", "pval", "padj", "label"]].rename(columns={"label": "region"}).to_excel(ew, sheet_name="SV genes per region", index=False)
go_enrichment.rename(columns={"region": "region_name", "label": "region"}).to_excel(ew, sheet_name="SV genes GO enrichment", index=False)
data.obs[["segmentation_labels_default", "segmentation_labels"]].reset_index().rename(columns={"index": "spatial_barcode", "segmentation_labels": "region", "segmentation_labels_names": "region_name"}).to_excel(ew, sheet_name="segmentation (counts)", index=False)
import SpatialDE
import numpy as np
import scipy
import pandas as pd
import scanpy as sc
import anndata as ad
from tqdm.auto import trange, tqdm
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('png')
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['svg.fonttype'] = 'none'
mpl.rc('font', **{'family':'sans-serif','sans-serif':['Helvetica']})
from plotnine import *
import mizani
theme_set(theme_bw())
theme_update(legend_key=element_blank(),
text=element_text(family="Helvetica"),
axis_text=element_text(color="black"),
panel_border=element_rect(color="black"),
strip_background=element_rect(color="black")
)
from io import StringIO
from itertools import chain
from bioservices import BioMart, UniProt
import goatools
a30_0 = sc.read_visium("152807")
a30_0.var_names_make_unique()
a30_0.var["mt"] = a30_0.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(a30_0, qc_vars=["mt"], inplace=True)
sc.pp.filter_cells(a30_0, min_counts=800)
a30_0 = a30_0[a30_0.obs["pct_counts_mt"] < 20]
sc.pp.filter_genes(a30_0, min_cells=100)
s = BioMart()
s.add_dataset_to_xml("hsapiens_gene_ensembl")
s.add_attribute_to_xml("ensembl_gene_id")
s.add_attribute_to_xml("external_gene_name")
s.add_attribute_to_xml("entrezgene_accession")
s.add_attribute_to_xml("entrezgene_id")
s.add_filter_to_xml("ensembl_gene_id", ",".join(a30_0.var.gene_ids.to_numpy()))
res = s.query(s.get_xml())
res = pd.read_table(StringIO(res), names=["ensembl_gene_id", "gene_name", "entrezgene_accession", "entrezgene_id"], dtype={"entrezgene_id": pd.Int32Dtype()})
def filter_entrez(group):
group = group.drop("ensembl_gene_id", axis=1)
if group.shape[0] == 1:
ret = group.iloc[0,:]
else:
matched = group[group.gene_name == group.entrezgene_accession]
ret = matched.iloc[0,:] # no clear way to choose if there are multiple matches, so I just choose the first one
return ret.drop("entrezgene_accession")
res = res.dropna().reset_index(drop=True).groupby("ensembl_gene_id").apply(filter_entrez)
a30_0.var["entrez_id"] = (
a30_0.var.
reset_index().
rename(columns={"index":"gene_name", "gene_ids": "ensembl_gene_id"}).
merge(
res.assign(entrezgene_id=res.entrezgene_id.astype(pd.Int32Dtype())),
on=["gene_name", "ensembl_gene_id"]).
set_index("gene_name").
rename_axis(index=None)["entrezgene_id"]
svg_full, _ = SpatialDE.test(a30_0, omnibus=True)
svg_full["total_counts"] = np.asarray(a30_0.X.sum(axis=0)).squeeze()
svg_full.to_pickle("152807_svg_full.pkl")
svg_full = pd.read_pickle("152807_svg_full.pkl")
vargenes = svg_full[svg_full.padj < 0.001].sort_values("total_counts", ascending=False).gene[:2000]
segm, _ = SpatialDE.tissue_segmentation(a30_0, genes=vargenes, rng=np.random.default_rng(seed=42))
region_mapping = {0:"basal layer", 1: "myometrium LQ", 2: "myometrium", 3: "epithelium", 4: "functional layer", 7: "glands (border)", 11: "glands"}
a30_0.obs["segmentation_labels_names"] = a30_0.obs.segmentation_labels.cat.rename_categories(region_mapping)
with rc_context({"figure.figsize": (2,2)}):
sc.pl.spatial(a30_0, color="segmentation_labels_names", title="SpatialDE2", return_fig=True).savefig("figures/4a_spatialde.svg", dpi=1000)
a30_0_normalized = a30_0.copy()
sc.pp.normalize_total(a30_0_normalized, target_sum=1e4, key_added="scaling_factor")
a30_0_norm_reg = a30_0_normalized.copy()
sc.pp.log1p(a30_0_norm_reg)
a30_0_norm_reg = a30_0_norm_reg[:, vargenes]
sc.pp.regress_out(a30_0_norm_reg, ['total_counts'])
sc.pp.scale(a30_0_norm_reg, max_value=10)
sc.tl.pca(a30_0_norm_reg, svd_solver='arpack')
sc.pp.neighbors(a30_0_norm_reg, n_neighbors=10, n_pcs=50)
sc.tl.leiden(a30_0_norm_reg, random_state=42, resolution=0.6)
with rc_context({"figure.figsize": (2,2)}):
sc.pl.spatial(a30_0_norm_reg, color="leiden", title="Leiden", return_fig=True).savefig("figures/4a_leiden.svg", dpi=1000)
resolutions = (0.1, 0.3, 0.6, 0.8, 1., 1.5, 2.)
smoothnesses = (0.5, 1., 2., 3., 5., 8., 10.)
a30_0_norm_reg.obs["leiden_0.6"] = a30_0_norm_reg.obs.leiden
with rc_context({"figure.figsize": (2, 2)}):
for res in resolutions:
sc.tl.leiden(a30_0_norm_reg, random_state=42, resolution=res, key_added=f"leiden_{res}")
sc.pl.spatial(a30_0_norm_reg, color=f"leiden_{res}", return_fig=True, title=f"Leiden ({res})").savefig(f"figures/s4d_{res}.svg", dpi=1000)
testdata = a30_0.copy()
with rc_context({"figure.figsize": (2, 2)}):
for smoothness in smoothnesses:
segm_adj, _ = SpatialDE.tissue_segmentation(testdata, genes=vargenes, rng=np.random.default_rng(seed=42), params=SpatialDE.TissueSegmentationParameters(smoothness_factor=smoothness))
sc.pl.spatial(testdata, color=f"segmentation_labels", return_fig=True, title=f"SpatialDE2 ({smoothness})").savefig(f"figures/s4c_{smoothness}.svg", dpi=1000)
ulabels, lcounts = np.unique(segm.labels, return_counts=True)
svg_byregion = []
for l in ulabels[lcounts > 100]:
res, _ = SpatialDE.test(a30_0[segm.labels == l, :], omnibus=True)
res["label"] = l
svg_byregion.append(res)
svg_byregion = pd.concat(svg_byregion, axis=0, ignore_index=True)
svg_byregion.to_pickle("15807_svg_byregion.pkl")
svg_byregion = pd.read_pickle("15807_svg_byregion.pkl")
(
ggplot(svg_byregion.
assign(region=lambda x: pd.Categorical(x.label).rename_categories(region_mapping))[svg_byregion.padj < 0.001],
aes("region")) +
geom_bar() +
ylab("spatially variable genes") +
theme(axis_text_x=element_text(rotation=20, ha="right"), figure_size=(3, 2))
).save("figures/4b.svg"
from goatools.base import download_go_basic_obo, download_ncbi_associations
from goatools.obo_parser import GODag
from goatools.anno.genetogo_reader import Gene2GoReader
obo_fname = download_go_basic_obo()
fin_gene2go = download_ncbi_associations()
obodag = GODag(obo_fname)
objanno = Gene2GoReader(fin_gene2go, taxids=[9606])
ns2assoc = objanno.get_ns2assc()
background_genes = a30_0.var.entrez_id.to_numpy()
from goatools.goea.go_enrichment_ns import GOEnrichmentStudyNS
goeaobj = GOEnrichmentStudyNS(background_genes, ns2assoc, obodag, propagate_counts=False, alpha=0.01, methods=['fdr_by'])
go_enrichment = []
for label, g in svg_byregion.groupby("label"):
enrichment = goeaobj.run_study(a30_0.var.entrez_id[g.gene[g.padj < 0.001]].to_numpy().tolist(), prt=None)
res = []
for term in enrichment:
res.append({
"ont": term.NS,
"go": term.GO,
"desc": term.name,
"study_count": term.study_count,
"population_count": term.pop_count,
"study_ratio": np.divide(*term.ratio_in_study),
"population_ratio": np.divide(*term.ratio_in_pop),
"pval": term.p_uncorrected,
"padj": term.p_fdr_by,
"label": label,
"genes": a30_0.var.query("entrez_id in @term.study_items").index.to_numpy()
})
go_enrichment.append(pd.DataFrame(res))
go_enrichment = pd.concat(go_enrichment, axis=0, ignore_index=True).assign(label=lambda x: pd.Categorical(x.label))
go_enrichment["region"] = go_enrichment.label.cat.rename_categories(region_mapping)
(
ggplot(go_enrichment[(go_enrichment.padj < 1e-2) & (go_enrichment.ont == "BP")].assign(
label=lambda x: np.where(x.go.isin(("GO:0010273", "GO:0006882", "GO:0071294", "GO:0071280", "GO:0071276")), x.desc, x.go)),
aes("region", "label", color="-np.log10(pval)", size="study_ratio")) +
geom_point() +
facet_wrap("ont", ncol=1, scales="free") +
labs(size="fraction of sample", color=r"$-\log_{10}(p)$", y=None, x=None) +
theme(figure_size=(2.5,5), subplots_adjust={'hspace': 0.2}, axis_text_x=element_text(rotation=30, ha="right"))
).save("figures/4c.svg")
with rc_context({"figure.figsize": (2, 1)}):
for gene in ("MT2A", "MT1E", "MT1M", "MT1F", "MT1G", "MT1H", "MT1X"):
sc.pl.spatial(a30_0_normalized, color=gene, size=1.5, crop_coord=[0, 1000, 1500, 1200], return_fig=True).savefig(f"figures/4d_{gene}.svg", dpi=1000)
allsignifgenes = svg_full[svg_full.padj < 0.001].gene.to_numpy()
upper_patterns, upper_a30_0 = SpatialDE.spatial_patterns(a30_0[np.isin(segm.labels, (0, 3,4,7,8,9,11)),:], genes=allsignifgenes, rng=np.random.default_rng(seed=42), copy=True)
with rc_context({"figure.figsize": (2,1)}):
for i in range(upper_patterns.patterns.shape[1]):
sc.pl.spatial(upper_a30_0, color=f"spatial_pattern_{i}", title=f"spatial pattern {i}", crop_coord=[0, 1000, 1500, 1000], size=1.5, return_fig=True).savefig(f"figures/4e_pattern_{i}.svg", dpi=1000)
upper_a30_0_normalized = a30_0_normalized[np.isin(segm.labels, (0, 3,4,7,8,9,11)),:]
with rc_context({"figure.figsize": (2,1)}):
for gene in allsignifgenes[upper_patterns.labels == 1]:
sc.pl.spatial(upper_a30_0_normalized, color=gene, size=1.5, crop_coord=[0, 1000, 1500, 1000], return_fig=True).savefig(f"figures/4e_{gene}.svg", dpi=1000)
with rc_context({"figure.figsize":(2,2)}):
for gene, vmax in zip(["EPCAM", "SCGB2A2", "PAEP", "ACTA2", "MYLK"], (15, 15, None, None, None)):
sc.pl.spatial(a30_0_normalized, return_fig=True, color=gene, size=1.5, vmax=vmax).savefig(f"figures/s4b_{gene}.svg", dpi=1000)
rnakeys = ("q05_nUMI_factors", "mean_nUMI_factors", "q95_nUMI_factors")
weightkeys = ("q05_spot_factor", "mean_spot_factor", "q95_spot_factor")
cell2locationdata = ad.read_h5ad("20201207_LocationModelLinearDependentWMultiExperiment_19clusters_20952locations_19980genes/sp.h5ad")
cell2locationdata = cell2locationdata[cell2locationdata.obs["sample"] == "152807"]
a30_0_unfiltered = sc.read_visium("152807")
for key in chain(rnakeys, weightkeys):
a30_0_unfiltered.obs = a30_0_unfiltered.obs.join(cell2locationdata.obs.set_index("spot_id").filter(like=key), sort=False)
a30_0_celltypernas = tuple(a30_0_unfiltered.obs.filter(like=k) for k in rnakeys)
a30_0_celltypeads = tuple(ad.AnnData(X=ctr.to_numpy().round().astype(np.int32), obs=pd.DataFrame(index=a30_0_unfiltered.obs_names), var=pd.DataFrame(index=ctr.columns), uns=a30_0_unfiltered.uns, obsm=a30_0_unfiltered.obsm) for ctr in a30_0_celltypernas)
slabs = a30_0.obs.segmentation_labels_names[a30_0.obs.segmentation_labels.isin((0, 4,7, 11))]
obs = a30_0_celltypeads[1].obs.loc[slabs.index,:]
obs["segmentation_labels_names"] = slabs
(
ggplot(obs.
join(pd.DataFrame(a30_0_celltypeads[1].X, columns=a30_0_celltypeads[1].var_names.str[17:], index=a30_0_celltypeads[1].obs_names)).
reset_index().
melt(id_vars=["segmentation_labels_names", "index"], var_name="celltype", value_name="mean_spot_factor").
assign(segmentation_labels_names=lambda df: df.segmentation_labels_names.cat.remove_unused_categories()).
groupby("index").
apply(lambda g: g.assign(rel_mean_spot_factor=lambda g: g.mean_spot_factor / g.mean_spot_factor.sum())),
aes("celltype", "rel_mean_spot_factor", color="segmentation_labels_names")) +
geom_sina(scale="width", shape=".", alpha=0.6, size=0.01) +
labs(x=None, y="fraction of transcripts") +
guides(color=guide_legend(title="", override_aes={"alpha": 1, "shape": 'o', "size": 2})) +
theme(figure_size=(10, 2), axis_text_x=element_text(rotation=30, ha="right"))
).save("figures/s4a.svg")
c2l_segm, _ = SpatialDE.tissue_segmentation(a30_0_celltypeads[1], rng=np.random.default_rng(seed=42))
with rc_context({"figure.figsize": (2,2)}):
sc.pl.spatial(a30_0_celltypeads[1], color="segmentation_labels", title="SpatialDE2", return_fig=True).savefig("figures/5a_spatialde.svg", dpi=1000)
a30_0_celltypeweights = tuple(a30_0_unfiltered.obs.filter(like=k) for k in weightkeys)
a30_0_celltypeweightads = tuple(ad.AnnData(X=ctr.to_numpy(), obs=a30_0_unfiltered.obs, var=pd.DataFrame(index=ctr.columns), uns=a30_0_unfiltered.uns, obsm=a30_0_unfiltered.obsm) for ctr in a30_0_celltypeweights)
sc.pp.neighbors(a30_0_celltypeweightads[1], n_neighbors=10, n_pcs=0, random_state=42, metric="cosine")
sc.tl.leiden(a30_0_celltypeweightads[1], random_state=42, resolution=0.4)
with rc_context({"figure.figsize": (2,2)}):
sc.pl.spatial(a30_0_celltypeweightads[1], color="leiden", title="Leiden", return_fig=True).savefig("figures/5a_leiden.svg", dpi=1000)
resolutions = (0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1, 1.2)
with rc_context({"figure.figsize": (2,2)}):
for r in resolutions:
sc.tl.leiden(a30_0_celltypeweightads[1], random_state=42, resolution=r, key_added=f"leiden_{r}")
sc.pl.spatial(a30_0_celltypeweightads[1], color=f"leiden_{r}", title=f"Leiden ({r:.1f})", return_fig=True).savefig(f"figures/s5_leiden_resolution_{r:.1f}.svg", dpi=1000)
a30_0_umap = a30_0.copy()
a30_0_umap.obs["segmentation_labels_c2l"] = a30_0_celltypeads[1].obs.segmentation_labels
sc.pp.highly_variable_genes(a30_0_umap, flavor="seurat_v3", n_top_genes=2000)
sc.pp.normalize_total(a30_0_umap, target_sum=1e4, inplace=True)
sc.pp.log1p(a30_0_umap)
sc.pp.regress_out(a30_0_umap, ['total_counts'])
sc.pp.scale(a30_0_umap, max_value=10)
sc.pp.pca(a30_0_umap, n_comps=10)
sc.pp.neighbors(a30_0_umap, n_neighbors=15)
sc.tl.umap(a30_0_umap, min_dist=0.5)
with rc_context({"figure.figsize": (2,2)}):
sc.pl.umap(a30_0_umap, color="segmentation_labels_c2l", groups=(1,5,8), size=10, title="", return_fig=True).savefig("figures/5b.svg")
(
ggplot(a30_0_celltypeads[1].obs.
query("segmentation_labels in (1,5,8)").
join(pd.DataFrame(a30_0_celltypeads[1].X, columns=a30_0_celltypeads[1].var_names.str[17:], index=a30_0_celltypeads[1].obs_names)).
reset_index().
melt(id_vars=["segmentation_labels", "index"], var_name="celltype", value_name="mean_spot_factor").
assign(segmentation_labels=lambda df: df.segmentation_labels.cat.remove_unused_categories()).
groupby("index").
apply(lambda g: g.assign(rel_mean_spot_factor=lambda g: g.mean_spot_factor / g.mean_spot_factor.sum())),
aes("celltype", "rel_mean_spot_factor", color="segmentation_labels")) +
geom_sina(scale="width", shape=".", alpha=0.6, size=0.01) +
labs(x=None, y="fraction of transcripts") +
guides(color=guide_legend(title="", override_aes={"alpha": 1, "shape": 'o', "size": 2})) +
theme(figure_size=(6, 2), axis_text_x=element_text(rotation=30, ha="right"))
).save("figures/5c.svg")
(
ggplot(a30_0_celltypeads[1].obs.
join(pd.DataFrame(a30_0_celltypeads[1].X, columns=a30_0_celltypeads[1].var_names.str[17:], index=a30_0_celltypeads[1].obs_names)).
reset_index().
melt(id_vars=["segmentation_labels", "index"], var_name="celltype", value_name="mean_spot_factor").
assign(segmentation_labels=lambda df: df.segmentation_labels.cat.remove_unused_categories()).
groupby("index").
apply(lambda g: g.assign(rel_mean_spot_factor=lambda g: g.mean_spot_factor / g.mean_spot_factor.sum())),
aes("celltype", "rel_mean_spot_factor", color="segmentation_labels")) +
geom_sina(scale="width", shape=".", alpha=0.6, size=0.01) +
labs(x=None, y="fraction of transcripts") +
guides(color=guide_legend(title="", override_aes={"alpha": 1, "shape": 'o', "size": 2}, nrow=1)) +
theme(figure_size=(15, 2), axis_text_x=element_text(rotation=30, ha="right"), legend_position="top")
).save("figures/5d.svg")
with pd.ExcelWriter("figures/table_s_endometrium.xlsx") as ew:
svg_full[["gene", "pval", "padj"]].to_excel(ew, sheet_name="SV genes whole slice", index=False)
svg_byregion[["gene", "pval", "padj", "label"]].rename(columns={"label": "region"}).to_excel(ew, sheet_name="SV genes per region", index=False)
go_enrichment.rename(columns={"region": "region_name", "label": "region"}).to_excel(ew, sheet_name="SV genes GO enrichment", index=False)
a30_0.obs[["segmentation_labels", "segmentation_labels_names"]].reset_index().rename(columns={"index": "spatial_barcode", "segmentation_labels": "region", "segmentation_labels_names": "region_name"}).to_excel(ew, sheet_name="segmentation (counts)", index=False)
a30_0_celltypeads[1].obs.reset_index().rename({"index": "spatial_barcode", "segmentation_labels": "region"}).to_excel(ew, sheet_name="segmentation (cell2location)", index=False)
有点难,生活很好,有你更好