10X单细胞降维分析之PHATE

目前单细胞数据做降维分析的方法有很多(PCA,TSNE,UMAP),大家不用一个一个的去试,掌握一些主要的分析软件,深入理解其中的原理和代码,实现软件之间的有优势互补,达到我们的分析目的。

今天给大家分享一个方法,文献在Visualizing structure and transitions in high-dimensional biological data,影响因子36分多,相当高了。今天我们的任务就是来参透文章及分享代码,大家一定要认真学习,掌握精髓,而不是简单的copy 代码。

文章部分:

一、摘要:

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.(这部分没什么意思,夸自己的软件呗

二、简介

首先单细胞数据确实需要非常好的可视化软件,目前存在的可视化软件包括principalcomponent analysis (PCA)、 t-distributed stochastic neighbor embedding (t-SNE)and Uniform Manifold Approximation and Projection (UMAP),其实大家现在用的最多的应该就是UMAP,然而,these methods are suboptimal for exploring high-dimensional biological data.至于原因:
(1)such methods tend to be sensitive to noise.(这个地方不知道大家研究过没,单细胞数据的降噪和droplet的分析),methods like PCA and Isomap fail to explicitly remove this noise for visualization, rendering fine-grained local structure impossible to recognize.(这个地方需要注意,PCA确实有这个问题)
(2)nonlinear visualization methods such as t-SNE often scramble the global structure in data(全局结构不够精确,所以现在更多的用UMAP)。
(3)many dimensionality-reduction methods (for example, PCA and diffusion maps) fail to optimize for two-dimensional (2D) visualization as they are not specifically designed for visualization.(听过我的课程的同学是不是很熟悉!!😄)
(4)common implementations of dimensionality reduction methods often lack computational scalability。(扩展性差),State-of-the-art methods such as multidimensional scaling (MDS) and t-SNE were originally presented as proofs-of-concept with somewhat naive implementations, which do not scale well to datasets with hundreds of thousands, let alone millions, of data points owing to speed or memory constraints.(这个地方不知道大家有没有研究过,再次强调,不要只是照抄代码,做一个理性的人)。
(5)some methods try to alleviate visualization challenges by directly imposing a fixed geometry or intrinsic structure on the data.However, methods that impose a structure
on the data generally have no way of alerting the user whether the structural assumption is correct.(这个地方许多新的软件已经修正了)。作者举了例子,any data will be transformed to fit a tree with Monocle212 or clusters with t-SNE. While such methods are useful for data that fit their prior assumptions, they can generate misleading results otherwise, and are often ill suited for hypothesis generation or data exploration(这个地方大家很熟悉吧,为什么聚类和monocle2的结果总是不尽如人意,明白了吧!!)
接下来就是PHATE软件的优势了,我们略过。。。。。。
provides an accurate, denoised representation of both local and global structure of a dataset in the required number of dimensions without imposing any strong assumptions on the structure of the data, and is highly scalable both in memory and runtime.

图片.png

三、Result

我们现在看一些基础的知识
(1)t-SNE focuses on preserving local structure, often at the expense of the global structure
(2)PCA focuses on preserving global structure at the expense of the local structure
(3)Although PCA is often used for denoising as a preprocessing step, both PCA and t-SNE provide noisy visualizations when the data is noisy, which can obscure the structure of the data(这个地方大家一定找掌握,不然分析数据完了也不知道对还是错)。
(4)By contrast, diffusion maps effectively denoise data and learn the local and global structure.However, diffusion maps typically encode this information in higher dimensions, which are not amenable to visualization, and can introduce distortions in the visualization under certain conditions(diffusion maps的方法,之前的课程讲过的)。


图片.png

重点来了,PHATE is designed to overcome these weaknesses and provide a visualization that preserves the local and global structure of the data, denoises the data and presents as much information as possible into low dimensions.


图片.png

我们来看一下主要的步骤:

(1)Encode local data information via local similarities (局部结构),这里使用的距离仍然是欧氏距离(R语言里面对于距离的定义我课上讲过,基础大家一定要知道)。


图片.png

(2)Encode global relationships in data using the potential distance。这里用到的就是diffusion map的算法,这个课上我也讲过。
(3)Embed potential distance information into low dimensions for visualization.(低维可视化)this ensures that all variability is squeezed into two dimensions for a maximally informative embedding


图片.png

文献推荐的分析策略

Here we present new methods that provide suggested end points, branch points and branches on the basis of the information from higher-dimensional PHATE embeddings(数据结构的分析,大家其实可以看得出来,结构与monocle2树形结构差不多)。
(1)Branch-point identification with local intrinsic dimensionality。大家看一下下图对于branch points的定义。branch points often encapsulate switch-like decisions where cells sharply veer towards one of a small number of fates。


图片.png

图片.png

(2)End-point identification with diffusion extrema.(这个软件居然还要识别end points,跟URD有一拼。)We identify end points in the PHATE embedding as those that are least central and most distinct by computing the eigenvector centrality and the distinctness of a cellular state relative to the general data by considering the minima and maxima of diffusion eigenvectors as motivated by ref.这个地方有兴趣可以好好研究一下, branch point和end spoint的识别,以及填充细胞到轨迹上,对先验知识要求很高,当然也就意味着更为准确。我们看一下填充的效果


图片.png

跟力导向布局差不多。

软件之间的比较。

这部分我们简单看一下就可以了。


图片.png

看一下结果,当然,PHATE的准确度高,这个从理论上讲试必然的,因为PHATE对于人为的监督要求更高。PHATE had the highest DEMaP score in 22 of 24 comparisons and was the top-performing method overall。Uniform manifold approximation and projection (UMAP) was the second best performing method overall but had the highest DEMaP score in only two of the comparisons, one of which is equal with PHATE.(UMAP的优势)。

不同方法之间的降维可视化比较
图片.png

PHATE provides a clean and relatively denoised visualization of the data that highlights both the local and global structure。当然,后面还有一些数据分析的结果,这都是套路了,大家看一下就可以。

其实我们这里总结一句,PHATE解决的问题就是,降维可视化的结果与细胞本身的内在联系相互对应,PHATE方法最好,UMAP次之。

接下来,我们看一下代码:

加载模块

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import phate
import scprep
import sklearn.decomposition # PCA
import sklearn.manifold # t-SNE
import umap

至于读取数据,质控之类的我们这里就不分享了,就看PHATE降维可视化,

phate_operator.set_params(knn=4, decay=15, t=12)
Y_phate = phate_operator.fit_transform(EBT_counts)
这个地方我们来关注一下参数问题:
    knn : Number of nearest neighbors (default: 5). Increase this (e.g. to 20) if your PHATE embedding appears very disconnected. You should also consider increasing knn if your dataset is extremely large (e.g. >100k cells)
    decay : Alpha decay (default: 15). Decreasing decay increases connectivity on the graph, increasing decay decreases connectivity. This rarely needs to be tuned. Set it to None for a k-nearest neighbors kernel.
    t : Number of times to power the operator (default: 'auto'). This is equivalent to the amount of smoothing done to the data. It is chosen automatically by default, but you can increase it if your embedding lacks structure, or decrease it if the structure looks too compact.
    gamma : Informational distance constant (default: 1). gamma=1 gives the PHATE log potential, but other informational distances can be interesting. If most of the points seem concentrated in one section of the plot, you can try gamma=0.

如果真如文章所说,PHATE有能力learn and maintain local and global distances in low dimensional space,那么这个可视化的结果,高于UMAP,是最合适的。


图片.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,402评论 6 499
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,377评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,483评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,165评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,176评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,146评论 1 297
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,032评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,896评论 0 274
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,311评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,536评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,696评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,413评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,008评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,659评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,815评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,698评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,592评论 2 353