『文献阅读』REViGO

文献：REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms

发表年份：Received March 2, 2011; Accepted June 7, 2011; Published July 18, 2011

期刊：PLoS one

引用： 1980

DOI： https://doi.org/10.1371/journal.pone.0021800

背景摘要

原文：

Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.

重点：

GO富集分析的结果会出现高度的冗余，因此很难解释。
REViGO可以通过使用依赖于语义相似性度量（representative subset of the terms using a simple clustering algorithm）来获取具有代表性的GO terms。从而将GO enrichment结果去冗余。
REViGO可以将这些非冗余的结果可视化，帮助解析这些这些语义关系（semantic relationships）和层级关系。
还提供了树图（treemaps）和标签云图（treemaps）作为可选的可视化内容

介绍

Para1: 讲GO富集分析，没用不看

Para2：语义冗余如何影响结果

As high-throughput techniques become cheaper and more accurate, they detect even slight changes in gene expression or other measured properties. The lists of relevant genes will grow in size, and so will the derived lists of GO terms. Additionally, the redundancy in the resulting set of GO terms confounds interpre- tation and inflates the perceived number of biologically relevant results. This is frequently the case when analyzing terms in a parent- child relationship, e.g. the parent term ‘‘GO:0009058 biosynthetic process’’ fully encompasses its child term ‘‘GO:0008610 lipid biosynthetic process’’. In a list of terms enriched with overexpressed genes, if the child term has highly statistically significant enrichment, the parent term might appear significantly enriched purely as a consequence of including all the genes from the child term.

最常见的例子：比如说GO:0009058 biosynthetic process是GO:0008610 lipid biosynthetic process的父层，比如说在脂质合成代谢这个GO Term的基因有超高显著的统计学意义上的富集，那么合成代谢这个GO term的显著富集是完全就是因为脂质合成代谢的富集而变的显著。

Para3：介绍一些去冗余工具： GOrilla,RedundancyMiner不解释了。
Para4： Go slim 介绍和利弊

In the same vein, researchers may attempt to simplify long GO term lists by replacing the full Gene Ontology with ‘‘GO Slims’’, cut-down versions of the Gene Ontology. The GO slims are, however, limited to general (high-level) GO terms which are typically less interesting than the more fine-grained terms – the ones that have been removed from the GO slims. Thus, the problem of weeding out the redundant GO terms is not easily solved by removing the GO terms’ descendants (or ancestors) in this manner. The complex structure of the GO warrants a solution that takes into account the terms’ proximity in the GO graph, quantified by the GO term ‘semantic similarity’ measures [8].

GO silms, 可以称之为精简版（阉割版）的基因本体论。详细看：GO silm quickgo geneontology-go-subset

精简后的GO slim确实在一定程度上达到了去冗余效果，但是这种从整个GO整体中抽取一部分子集的做法会忽略掉许多细节，而这些细节的重要性往往要比那些high-level的GOterm更有意义。
总的来说，GO slim确实表面上大大的减少了工作量，但是忽略掉了更多重要的东西，直接CJ也在群里聊过，并不盲目推崇GO slim。
作者这里讲，对于GO slim的解读需要一种更加复杂的科学的方法，这里他们提出用语义相似性来解决。

Para5: REViGO作用

We have implemented a computational approach that (a) summarizes long GO lists by reducing functional redundancies, and (b) visualizes the remaining GO terms in two-dimensional plots, interactive graphs, treemaps or tag clouds. Both the summarization and the visualization step draw on the concept of GO term semantic similarity, reviewed in [8]. In particular, several common measures of semantic similarity [9] that employ the ‘most informative common ancestor’ approach are supported. The implementation is freely available as the REVIGO Web server at http://revigo.irb.hr/.

首先通过减少冗余来简化复杂的GO term list。
将包留下来的的GO term进行多中可视化。
网站：http://revigo.irb.hr/.

结果和讨论

A simple algorithm to reduce redundancy within lists of GO terms

To mitigate the problem of large and redundant lists, we aim to find a single representative GO term for each of these clusters. REVIGO performs a simple clustering procedure which is in concept similar to the hierarchical (agglomerative) clustering methods such as the neighbor joining approach [10]. A flowchart of the steps in the algorithm is given in Fig. 1.

REViGO 会对冗余的GO terms列表先聚类，然后给每个cluster找到一个代表性的GO term。这种简单的聚类方法参照邻接法（neighbor joining approach）

The intuition behind this procedure is to form groups of highly similar GO terms, where the choice of the groups’ representatives is guided by the p-values, enrichments or similar values that the user supplies alongside the GO terms (Fig. 1).

这个过程其实就是将高度相似的GO terms分组，这种分组是由pvalue，enrichment等这些数值来指导的。

If the p-values are quite close and one term is a child node of the other, REVIGO will tend to choose the parent term, with a possible exception when the terms are deemed to be de facto equivalent (Fig. 1, see caption). Note that REVIGO generally does not prioritize higherlevel or lower-level GO terms as cluster representatives – instead, the user-supplied p-values/enrichments are used to guide the selection, if possible.

如果p值相近，而且一个term是另一个的子节点的时候，会倾向于选择父项。当然，如果两个term被视为事实上等同的时候会例外。
软件通常情况下不会优先把高等级或者低等级的GO terms作为cluster的代表，在可能的情况下，用户提供的p值或者enrichments会指导这个选择。

Very general GO terms, however, are always avoided as cluster representatives (Fig. 1) as they tend to be uninformative. It is also possible to manually override the choice of the representative GO term using the ‘pin’ option in case the default solution is not satisfactory for the user e.g. when a more general, higher-level term is desired to represent the group.
The user does not necessarily need to provide previously determined pvalues or another numerical value alongside the GO terms. In that case, REVIGO will prioritize the terms with higher ‘uniqueness’ the negative of average similarity of a term to all other terms.

软件尽量避免让通用的GO terms作为cluster的代表，因为他们反应的信息很有限（uninformative），比如说：催化反应... 软件支持用pin来替换代表性的GO term，比如当你想用那些高级别的GO term作为cluster代表的时候。
当然，pvalue也不是必须的，如果没有p值的话，软件倾向与取有较高唯一性的term。

The terms that remain in the list after the algorithm has finished are the cluster representatives, where it is guaranteed that no two representatives will be more similar than a user-provided cutoff value C. In other words, a lower (more stringent) value of C will result in a shorter, but also a more semantically diverse list. To offer some bearing on the relationship of C to statistical significance, we conducted a simulation where we drew random pairs of GO terms and recorded the distribution of the SimRel semantic similarity measure [11] (default in REVIGO). One percent of randomly generated GO term pairs have SimRel.0.53. Therefore, at C= 0.53 there is a 99% chance an abovebackground similarity exists between each pair of terms in a cluster. REVIGO offers four pre-defined values of C (0.9, 0.7, 0.5 and 0.4) to the user. The lowest value of C= 0.4 – corresponding to the ‘‘tiny’’ list size – should be used with caution, as many GO terms might be removed from the list without strong statistical support for their redundancy with respect to other terms. The values of C= 0.7 (default) and 0.9 are much more conservative in this respect, but may not shorten the list enough.

这里提到一个用户提供的阈值C，在分析完成后，没有两个代表性的GO term阈值会小于C。换句话说，C值越小，列表越短（越严格），语义也更加多样化。
通过随机模拟发现，当C=0.53的时候，cluster中每对GO term之间的背景相似性（abovebackground similarity）为99%
不推荐C=0.4，这样会丢失许多微小的cluster。
C=0.9或者C=0.7（default）是保守的选择，但是这个阈值不会有效的减少GO terms数量。

Figure1 A flowchart describing the REVIGO algorithm to remove redundant GO terms from the provided GO term list.

Figure1

对于一个GO通路中的所有GO terms，成对的计算语义相似性。
找到最相似的两个Terms *t_i，t_j
如果*t_i，t_j的相似性比阈值C低，结束。
如果*t_i，t_j大于阈值C,需要去掉一个，根据下面的规则
- 如果这个Term只有一个非常广义的解释（frequency > 5）,拒绝掉这个非常广义的term，重新返回第一层，再找一个term和保留下来的term做判断。
- 如果上面为假（不是general term），看两个go term的p值，扔掉p值不显著的那个。剩下的轮回（或者都扔掉？）
- 如果p值接近（且显著？）判断*t_i，t_j 是否为父子关系，如果是扔掉go level低的，parent去轮回。
- 如果不是，随机扔掉一个，剩下的那个去轮回。
- 最终剩下一个就是这个cluster的代表GO term。

Visualization in scatterplots and interactive graphs

In drawing scatterplots (Fig. 2), the challenge lies in assigning x and y coordinates to each term so that more semantically similar GO terms are also closer in the plot. Here, we employ a multidimensional scaling procedure which initially places the terms using an eigenvalue decomposition of the terms’ pairwise distance matrix. This is followed by a stress minimization step which iteratively improves the agreement between the GO terms’ semantic similarities and their closeness in the displayed twodimensional space. The GO terms’ and associated data (term descriptions, p-values/enrichments, uniqueness, etc.) can be exported to a convenient text table and downloaded.

首先说散点图（scatterplots），我们会对每个GO Term一个x，y坐标值，这样可以保证语义相似的GOterm在图上更加接近。后面一堆blah blah听不懂ಥ_ಥ.... 最终的目的就是达到刚才说的这个，而且这些GO terms以及他们相关的值可以在网站下的表中找到，而且可以下载。

figure2

REVIGO also allows the user to make a graph-based visualization (Fig. 3). Each of the GO terms is a node in the graph, and 3% of the strongest GO term pairwise similarities are designated as edges in the graph. The threshold value of 3% was derived empirically; we found it strikes a good balance between over-connected graphs with no visible subgroups on the one hand, and very fragmented graphs with too many small groups on the other hand. The placement of the nodes is determined by the ForceDirected layout algorithm as implemented in Cytoscape Web [12]. In addition to being viewed in the Web browser, the graph may be exported to a XGMML file, or opened in the standalone Cytoscape program [13] via Java Web Start to produce high resolution, publication-quality images. Both visualizations indicate the generality of the GO terms by the bubble radius, where smaller bubbles imply more specific terms; the user-supplied p-values/ enrichments are shown using color shading.

node属性：所有的GO terms
edge属性：GO term成对相似性（pairwise similarities）的前3%,3%为经验值。
大小和GOterm的层级有关，越详细越小，越笼统越大
颜色和p值或者enrichments有关。

figure3

Two additional views of the user’s data are supported in REVIGO. Treemaps (Fig. 4) show a two-level hierarchy of GO terms – the cluster representatives from the scatterplot and the graph are here joined into several very high-level groups. Tag clouds show (a) keywords which are overrepresented in the GO terms’ descriptions in the GO term list provided by the user (Fig. 5)

树形图展示了GO term的层级关系，这里会吧这些go terms分配到层级较高的几个大cluster中, 而且把散点图中代表性的term展现出来。

词云图则和普通的词云图一样，把高频GO term对应的description中出现的词汇突出出来。

figure4

figure5

An example use-case: summarizing the putative targets of a transcription factor

table1

To illustrate how REVIGO’s redundancy elimination algorithm (Fig. 1) works, we turn to a ‘toy example’ which has seven GO categories with associated p-values (Fig. 6). This dataset [14] lists gene functional categories co-expressed with the human gene coding for the transcription factor ZNF417, but not with the highly related protein ZNF587, measured using Affymetrix U133plus2 microarrays. The ZNF417 is an evolutionarily recent, great ape-specific transcription factor of which the ZNF587 is a more ancient homolog [14]; gene functions associated specifically to ZNF417 were found to be associated with brain development.

这个示例数据：这个基因set与转录因子ZNF417共表达，但是不与ZNF587共表达，587比417更加古老，417是在类人猿中新出现的。与417相关的基因可能和大脑发育相关。

A casual inspection reveals subgroups of redundant gene functions. For instance, the GO term ‘‘cerebral cortex neuron differentiation’’ has a high semantic similarity (SimRel = 0.72) to ‘‘telencephalon development’’ and is therefore removed by merging it into the cluster represented by the term having a more significant p-value (Fig. 6). The removed term is assigned a ‘dispensability’ value of 0.72, a relatively high value reflecting the removed term’s strong redundancy with respect to the chosen representative. In the next group of terms, ‘‘astrocyte differentiation’’ and ‘‘negative regulation of neuron differentiation’’ are similar (0.74 and 0.62, respectively) to ‘‘negative regulation of glial cell differentiation’’. Due to a weaker p-value, the first two terms are merged into a cluster represented by the last term (Fig. 6). Note how the choice of cluster representatives is unaffected by whether terms are more general or more specific. The highest remaining pairwise similarity (here, 0.40) is below the user-defined threshold C, here set to 0.5, and the clustering algorithm stops. In other words, after having removed the redundant terms, the ones that remain as the cluster representatives are those terms having dispensability values below C. The example list of seven GO terms has been reduced to four clusters, of which two are singletons

cerebral cortex neuron differentiation相对于telencephalon development 来说语义相似性达到0.72（SimRel = 0.72），那么cerebral cortex neuron differentiation会被合并到telencephalon development 中从主干上移除掉,被移除的术语被分配了0.72的“可有可有性”值，该值相对较高，反映了被移除的词语相对于所选代表的强冗余性。而合并后的cluster SimRel值变为0.4，已经小于阈值0.5，那么就终止循环了，这一个cluster就包含了他自己和cerebral cortex neuron differentiation

同样negative regulation of glial cell differentiation语义相似的两个term分别有0.62和0.74的SimRel值，在合并后成了0.

就是根据figure1的复杂流程对这些go term聚类，计算SimRel值，来达到去语义去冗余的目的, 例子中将7个go term减少到4个

如果C值设置的比较高，比如0.7或者0.9，就无法很好的去冗余。

A possible alternative for REVIGO’s summarization procedure are the frequently used ‘‘GO slims’’. Here, the seven terms are quite specific and consequently none of them is in the ‘‘generic’’ or ‘‘PIR’’ GO slims (http://www.geneontology.org/GO.slims.shtml). Therefore, the GO slim approach would not apply to this dataset, illustrating the general principle of how summarizing the list by filtering out the more specific (or equivalently, higher information content) GO terms results in a loss of the potentially more interesting results.

这里点名了GO slim，由于这7个term太具体了，GO slim其实在这个例子中没法用。

In addition to the ‘dispensability’ values, REVIGO provides ‘uniqueness’ values. These two values are anticorrelated, though not perfectly, since ‘uniqueness’ measures whether the term is an outlier when compared semantically to the whole list (without regard for the p-values), while the ‘dispensability’ compares a term to other semantically close terms and is assigned based both on the semantic distance and the supplied p-values.
提出了唯一性**，可分性（dispensability）和唯一性（uniqueness）这两个值是完全程现反相关的，尽管不完美，但是唯一性值可以判断这个go term和整体相比是不是一个离群值。

To demonstrate the multidimensional scaling-based visualization in REVIGO, we visualize these terms in Fig. 7; for illustrative purposes, all seven terms are visible in this instance, instead of only the four cluster representatives. Here, it can be seen how two terms are quite distinct from the rest and also from each other: ‘‘regulation of dopamine metabolism’’ and ‘‘sensory perception of chemical stimulus’’ – these terms were not assigned to any of the clusters in the redundancy elimination procedure described above. The remaining five terms are more closely related, where the ‘‘telencephalon development’’ and ‘‘negative regulation of glial cell differentiation’’ have more significant p-values than the three other terms and were thus chosen as cluster representatives.

一个结果图解读，直接机翻了：

为了在REVIGO中演示基于多维缩放的可视化，我们在图7中可视化这些术语；出于说明性目的，在这种情况下，所有七个术语都可见，而不仅仅是四个簇代表。在这里，可以看到两个术语是如何与其他术语以及彼此完全不同的：“多巴胺新陈代谢的调节”和“化学刺激的感官知觉”-在上述冗余消除过程中，这些术语没有被分配给任何群集。剩下的五个术语关系更为密切，其中“端脑发育”和“胶质细胞分化的负调控”比其他三个术语具有更显著的p值，因此被选为聚类代表。

figure7

最后是和其他软件的对比，没用过，所以就不讨论了

=========END===========