4. Power Graph模型

1. 和上一代有哪些区别?

GraphLab主要并发解决计算问题, 但是对于highly skewed power-law degree
distributions 解决的不太好, 这一点在上一代的基础上进一步尝试解决严重的数据偏斜.

论文地址: http://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/powergraph.pdf

为了应对这个情况, 把update操作变成 Gather Apply Scatter三个步骤
进一步优化切分, 把切分从edge cut 变成 vertex cut, 让热点存在于多个机器上, 每个机器保存这个热点的一部分关联edge以分散数据压力

2. 计算模型GAS

Gather
用户定义一个操作⊕,这个操作必须满足交换律和结合律
针对某个vertex u, 它所有关联的edge和vertex的data搜集到一起执行这个操作.
在有向图里这里只会计算入度的边

Gather

Apply
把上一步计算的结果进行汇总, 然后把汇总结果∑更新到执行的vertex

Apply
Scatter
根据执行中心点vertex在apply之后的值去更新和它关联的所有edge的值
在有向图里这里只会计算出度的边

我读论文说是更新edge的值, 国内好多技术博客说是更新edge和vertex的值, 我对自己的阅读能力产生了深深的怀疑...不过都不影响后续讲GraphX, 那边是直接有代码的

Scatter

interface GASVertexProgram(u) {
  // Run on gather_nbrs(u)
  gather(Du, D(u,v), Dv) → Accum
  sum(Accum left, Accum right) → Accum
  apply(Du,Accum) → Du
  // Run on scatter_nbrs(u)
  scatter(Du ,D(u,v),Dv) → (D(u,v), Accum)
}

partition切割模型

看完上面的介绍后, 立马产生一个问题就是Gather和Apply为啥是分开的, 算完之后直接更新上去不就行了么?

答案是, Gather过程不一定是在一个机器上运算的, 上文提到Power Graph是在Vertex处切割, 这意味着一个Vertex可能被切到多台机器上分别计算, 然后再汇总到一起!

按vertex切割

如上图所示, 在vertex切割的情况下, 我们在绿色的点上执行Gather操作后, 实际上CPU1上的进程和CPU2上的进程各持有了一部分数据, 这个时候想要执行Apply就需要两个进程通信后汇总结果, 然后分别更新绿色点的值.

在实际切分中有两种切分策略 Random切分和 Greedy切分, 感兴趣的可以看论文附送的Slide里面用动图讲了Greedy的策略:

A(v): set of machines that vertex v spans.
Case 1: If A(u) ∩ A(v) != ∅, then the edge should be assigned to a
machine in the intersection.
Case 2: If A(u)∩A(v) = ∅, then the edge should be assigned to one
of the machines from the vertex with the most unassigned edges.
Case 3: If only one of the two vertices has been assigned, then
choose a machine from the assigned vertex.
Case 4: If A(u) = A(v) = ∅, then assign the edge to the least
loaded machine

4. Power Graph模型

1. 和上一代有哪些区别?

2. 计算模型GAS

partition切割模型

推荐阅读更多精彩内容