hello,昨天我们分享了VECTOR的示例代码,文章在10X单细胞(10X空间转录组)轨迹分析(拟时分析)之VECTOR,2020年8月发表于Cell Reports,对于其原理,我们还是需要认真总结一下的,这篇短文就让我们来分享一下这篇文献,把握重点,看看这个软件的特点及运用情况,对软件的把握做到心中有数。
SUMMARY
A key step in trajectory inference is the determination of starting cells(这个大家应该深有体会,所以做个性化分析之前都是需要细胞定义的), which is typically done by using manually selected marker genes(目前大多数细胞定义的方法还是依赖于人工选择marker,相似性映射的方法目前问题太多). In this study, we find that the quantile polarization(分位数极化 ???) of a cell’s principal-component values is strongly associated with their respective states in development hierarchy(主成分的value与细胞发育状态相关), and therefore provides an unsupervised solution for determining the starting cells(这个地方需要深入研究一下). Based on this finding, we developed a tool named VECTOR that infers vectors of developmental directions for cells in UniformManifold Approximation and Projection (UMAP). In seven datasets of different developmental scenarios, VECTOR correctly identifies the starting cells and successfully infers the vectors of developmental directions. VECTOR is freely available for academic use at https://github.com/jumphone/Vector.(运用示例很好,每篇文章都是这么说的)。
INTRODUCTION
这个地方我们提炼一下
TI方法的算法(monocle,PAGA,slingshot等,这几个软件大家都应该很熟悉)设计有两个共同的组成部分:
- the use of dimensional reduction, clustering, or graph-building techniques to convert scRNA-seq data into a simplified representation of trajectory, and the ordering of cells along the trajectory.(降维聚类,很常规)
- there may be many alternative trajectories to choose from, most TI methods require the use of prior information, such as a set of known marker genes, to determine the starting cells (SCs) of the correct trajectory.(说白了,需要做细胞定义来决定发育的起点,不做细胞定义的轨迹分析都是耍流氓)
marker的人为主观选择确实存在很大的误差,Recently, a new study found that RNA velocity(RNA Velocyto确实这个方面做的不错,人为干预减少),the time derivative of gene expression states, could be estimated by modeling the relationship between unspliced and spliced mRNAs, making it possible to deduce the future transcriptional states of cells and consequently the developmental trajectories without the need of prior information for determining SCs(依据可变剪切来推断发育轨迹,这个方法高分文献经常用到),在没有使用任何先验信息的情况下,使用RNA速度鉴定了神经c谱系细胞的新型发育模型,证明了其在发育谱系分析中的有用性。
看一下RNA velocyto的缺点
- reanalyze raw sequencing data to determine intron reads for quantifying unspliced mRNAs, which is time-consuming and sometimes may not be possible because of the limitation of the sequencing platforms.(这也不算什么缺点)。
现在做单细胞分析确实PCA分析是必需的,Cells at different developmental states have been shown to
have distinct patterns of PC values.However, the patterns of a cell’s PC values have not yet been fully explored in the current TI methods.(这个地方作者持保留意见),In this study, we observed that the averaged polarization of a cell’s PC values across a large number of PC subspaces is strongly correlated with their developmental states, with SCs having the most polarized PC values.(这个地方需要注意一下,不知道大家注意过没有,初始细胞的PC值很特别么??待会看看看方法),We thus provided an unsupervised solution for determining the SCs based on the averaged polarization of a cell’s PC values.(依据PC值来确定发育起点,这个方法不能说是无监督,必须半监督),当然,作者的示例当然很不错,我们自己用需要点注意了。
Result
第一步是拿定义好的两个单细胞数据集验证软件的可靠性
我们做PCA分析的时候,一般选择前十几个PCA做下游的分析,Seurat本身会计算50个PCA,作者这个地方采用的却是150个PCA,这个地方依据是什么,需要在方法中看看了。
在数据集分析中发现,For both oligodendrocyte and enterocyte lineages, we found that cells at earlier developmental stages tend to have more extreme PC values(更极端的PCA值)(either very small or very large—i.e., highly polarized(极化原来是这个意思,服了)),while those at later developmental stages tend to have more intermediate PC values(这个规律还真没注意过,需要拿自己的数据来尝试一下了)。such patterns were more obvious if we inspected the density of the PC value quantiles at all 150 PC subspaces for cells at different developmental stages。(看图规律倒是很明显)
To quantify the polarization of the PC value quantiles, we next defines a quantile polarization (QP) score that averages the polarization of the PC value quantile of a given cell across all 150 PC subspaces(QP的定义,这个方式讲道理, 我还是第一次见),然后QP的值很发育层级相关性很高,with cells at the earliest developmental stages having the greatest QP scores。
We further experimented with using a different number of PCs, and found that such correlations were robust if the number of PCs used could explain ~20%–80% of the total variance。
UMAP直接推断轨迹发生,这个在monocle3软件中有运用
In essence, VECTOR treats a twodimensional UMAP representation of cells as an image and splits it into a number of pixels. After removing those pixels that do not include any cells, VECTOR focuses on the largest connected pixel (LCP) network in UMAP to infer developmental directions.(看来这个软件这是在UMAP图上进行轨迹的推断)。By averaging the QP scores of cells inside each pixel, VECTOR identifies the high-scoring pixels that have the greatest QP scores (top 10% by default).(PCA的极化值推断发育起点的细胞),作者也提到了这个方法可能会存在假阳性,Here, VECTOR considers not only QP scores but also the connectivity of cells in UMAP; from the high-scoring pixels, it selects the largest connected high scoring pixels as the starting point of development. (联合UMAP的分析结果进行综合分析,得到发育起点的细胞),Those isolated high-scoring pixels that are likely false positives are then filtered out.(这个地方其实有bug)。For each pixel in the LCP network, VECTOR computes a pseudotime score defined as
its network distance to the starting point of development(大部分软件都是这么计算的)。Finally, for a given target pixel VECTOR computes a vector (with arrow and length) by taking into consideration the information of all pixels in the LCP network, including the direction of the unit vector pointing from a selected pixel to the target pixel, the relative pseudotime score between the target pixel and the selected pixel, and the closeness of the selected pixel to the target pixel in the LCP network, and so on.(分析结果得到类似RNA Velocyto的图)。箭头的方向就是发育的方向,临近发育起点和发育中期,箭头较短,临近发育终点箭头较长。
运用示例
刚才定义好的两个数据集表现很好,成功识别了发育起点和轨迹
运用到其他示例数据,效果也不错
Vector 和 RNA Velocyto的比较
Vector效果更好,RNA Velocyto有截断,which may be caused by the lack of intron reads in these cells.当然,Velocyto也很难识别发育的起点。
接下来是运用到多发育分支的数据
效果不错。当然,软件也提供了人工选择发育起点的功能。
Method
The workflow of VECTOR
Given a two-dimensional UMAP representation of cells, VECTOR treats it as an image, and then splitting it into a number of pixels. We provide a parameter called ‘‘N’’ for defining the number of pixels in UMAP.
不仅仅有数据处理,还有图片处理的相关信息
大家不妨试一试吧
生活很好,有你更好