gremlin 性能

gremlin 是apache tinkerpop框架的查询语言，使用函数编程风格，非常的灵活。

学习gremlin的语法的话，建议看如下博客，非常系统的介绍。

本文持续更新在使用gremlin查询时遇到的性能问题。

1. 使用 match 的时候，避免使用g.V().match(_as("x").....)

gremlin使用函数式编程，同一个功能可以使用多种风格实现，特别是使用match语法的时候，执行引擎通常会自己进行优化，

如果发现性能很慢，可以查看执行计划。

下面的例子，在我的执行测试过程中性能有100倍的差异。

模式一： g.V().match(_.as("a").hasId("123").out("create").as("b"), _as("b").out("know").as("c"))

模式二： g.V("123").match(_as("a").out("create").as("b"), _as("b").out("know").as("c"))

模式一，从所有的节点集合开始，而模式二从给定的节点开始travel。

模式一在我的数据集上要5s，而模式二只要50ms。

2. 使用match 和 as_select 的时候，会有巨大的计算存储消耗

对于使用 as 和select 的语句，执行引起会把中间结果保存起来，并且触发路径tracking。

我有一个非常的复杂的查询，大概就是 a-b， b-c，c-d，然后选择a。

一开始使用的是 g.V("a").match(a-b, b-c,c-d).select(a) 这样的模式，速度非常的慢，都不返回。

使用profile 查看执行过程，每一个路径比如a-b, 都会被记录下来，比如a-b 有10条路径，那么就会被记录下来，

且后面 b-c ，c-d 都包含重复的节点，需要自己做dedup。select 的时候，整个结合很可能是做一个一个笛卡尔积，也就是

a，b，c，d 这样的集合，只有这样才能保证select出任何的想要的。所以这个过程是非常的耗时。

对于这样的匹配，建议使用 where语句，也就是说 g.V(a).where(a-b).where(b-c).where(c-d)

如下的例子所示：

g.V().hasLabel("Product").and(bothE("Produce_Company_Product").otherV().hasLabel("Company").has("ticker"))

.and(bothE("RELATE_TO_Indicator_Product").otherV().as("Indicator872139507").hasLabel("Indicator")

.and(bothE("RELATE_TO_Indicator_Dimension").otherV().hasLabel("Dimension").has("display_name",neq("分产品" )))

.and(bothE("RELATE_TO_Indicator_IndicatorType").otherV().hasLabel("IndicatorType"))

.and(bothE("RELATE_TO_Indicator_Card").otherV().hasLabel("Card").has("is_pub","true"))

).dedup()

3. 查询path，路径上节点属性不反悔问题。

使用类是于 g.V('341561777106063360').inE('Company_Produce_Product').otherV().path()，这样的查询语句，最后一个节点的properties不返回。

gremlin的best practice 建议查询时明确返回的属性，而不是 select * 这样子。

所以如果要返回所有节点的属性，可在后面添加 .by(identity())

g.V('341561777106063360').inE('Company_Produce_Product').otherV().path().by(identity())

持续更新， gremlin遇到的性能问题。