推荐系统 Implementation（2）IMM 算法参考

Reverse Influence Sampling in Python

反向影响力采样（RIS）的python实现
翻译自Hautahi Kingi博客

The influence maximization (IM) problem seeks a set of seed nodes in a network to maximize the expected number of nodes activated via an influence cascade initiated at that seed set. A previous post compared two IM algorithms - the Greedy algorithm of Kempe et al. (2003) and the CELF algorithm of Leskovec et al. (2007). For years, CELF (and its modified CELF++ version by Goyal et al. 2011) were the fastest IM algorithms with theoretically guaranteed performance bounds. The subsequent literature focussed largely on improving computational efficiency with heuristics that sacrificed theoretical guarantees in favor of speed. More recently, however, a new approach to the problem - Reverse Influence Sampling (RIS) - has emerged which is both fast and theoretically guaranteed, and is now one of the state-of-the-art IM methods. This post walks through an implementation of the RIS algorithm in Python and compares its solutions and computational speed to the CELF algorithm described in the previous post.

影响力最大化（IM）问题寻求网络中的一组种子节点，以最大化通过在该种子集激活的影响级联模型下的预期节点数。先前的文章比较了两种IM算法：Kempe等人的Greedy算法和Leskovec等人的CELF算法。多年来，CELF（以及Goyal等人于2011年修改的CELF ++版本）是最快的IM算法，具有理论上可保证的性能范围。随后的文献主要侧重于通过启发式（heuristics）来提高计算效率，而启发式则牺牲了对速度的理论保证。然而，最近出现了一种新的解决方法：反向影响采样（RIS），该方法既快速又在理论上得到了保证，现在已成为最先进的IM方法之一。这篇文章逐步介绍了RIS算法在Python中的实现，并将其解决方案和计算速度与上一篇文章中描述的CELF算法进行了比较。

The Reverse Influence Sampling Algorithm

The key insight enabling the RIS family of algorithms is the concept of a random reverse reachable set introduced to the IM problem by Borgs et al. (2014). A reverse reachable (RR) set for an arbitrary node v is generated by first sampling a graph g from the distribution generated by removing each edge e according to its propagation probability 1−pe and then taking the set of nodes in g that can “reach” v. A random reverse reachable (RRR) set is simply a RR set for a node selected uniformly at random. A visual example is presented below. Edges are sampled from the original network G on the left according to each edge’s assigned probability (the code in this post assumes that all edges share the same propagation probability so that pe=p). This produces a sampled network g in the middle figure, in which the edges with high probability tend to be picked. A random node D is selected and the resulting RRR set consists of those nodes with a directed path to D, which are surrounded by dashed lines.

RIS系列算法的关键见解是Borgs等人将随机反向可达集的概念引入IM问题。生成任意节点的反向可达集首先要根据传播概率（propagation probability）移除每个边e后产生的分布，对图g进行采样。其次需要取g中可以到达v的节点集合。随机反向可达集（RRR set）只是针对随机、均匀选择的节点的反向可达集。
下面是一个可视示例。根据每个边缘分配的概率从左边的原始网络G中抽取边（本文中的代码假定所有边缘共享相同的传播概率，因此pe = p）。这将在中间图形中生成一个采样网络g，其中倾向于选择具有较高概率的边缘。选择一个随机节点D，得到的随机反向可达集由具有指向D的定向路径的那些节点组成，这些节点被虚线包围。

Intuitively, if a node u appears in a RR set of another node v, then there is a directed path from u to v. A diffusion process from a seed set containing u therefore has some probability of activating v. This connection between RR sets and node activations is formalized in a lemma, which states that the probability that a diffusion process from any seed set S will activate any node v is equal to the probability that at least one node in S is contained within a RR set for v (see Observation 3.2 in Borgs et al. 2014). Based on this lemma, the family of RIS algorithms proceed in two steps:

Generate a set R of many independent RRR sets.
Select k nodes to cover the maximum number of RRR sets in R using the standard greedy algorithm, which obtains a (1−1/e−ϵ)-approximate solution to the problem.

直观地说，如果一个节点u出现在另一个节点v的反向可达集中，那么就有一个从u到v的有向路径。因此，包含u的种子集在扩散过程有激活v的可能性。反向可达集和节点激活之间的这种联系在一个引理中形式化，这表明种子集合S的扩散过程将激活节点v的概率等于反向可达集中包含S中至少一个节点的概率。基于这个引理，RIS系列算法分为两个步骤：

生成许多独立的反向可达集构成的集合R。
使用标准贪婪算法选择k个节点以覆盖R中尽可能多的反向可达集，该算法可获得该问题的（1 - 1/e - ϵ）近似解。

This approach works because for any seed set S, the fraction of RRR sets in R covered by S is an unbiased estimator of the spread of S (due to the lemma described above). Therefore, a seed set that covers a large number of RR sets in R is likely to have a large expected influence, which makes it a good solution to the IM problem.

该方法之所以有效，是因为对于任何种子集S，由S覆盖的R中的反向可达集的比例都是S传播的无偏估计（由于上述引理）。因此覆盖R中大量反向可达集的种子集可能具有较大的预期影响，这使其成为IM问题的良好解决方案。

The key to the very high computational performance of the RIS algorithm is that, unlike Greedy or CELF, it doesn’t repeat the spread computation procedure to incrementally construct a solution. Instead, it performs all the Monte Carlo simulations/sampling up front to construct R, from which it then selects the entire seed set. The method therefore avoids the “wasted” spread computations of the greedy algorithm and most of its successors because the generated RRR sets are used to inform the spread of all nodes within the network.

RIS算法非常高的计算性能的关键是，不像贪婪或CELF，它不重复扩展的计算过程来增量地构造一个解决方案。取而代之的是，它执行所有的蒙特卡罗模拟/采样来构建R，然后从其中选择整个种子集。因此，该方法避免了贪婪算法及其大多数后继算法的“浪费的”扩展计算，因为生成的反向可达集于通知网络内所有节点的扩展。

Below, I describe the implementation of the RIS algorithm with two separate functions. The first, get_RRS(), takes a graph and propagation probability and spits out a random reverse reachable set. The second, ris(), uses the first function to generate a large number of random reverse reachable sets, which are then used to select the influential seed set. We start by loading some packages:

下面，我将用两个单独的函数描述RIS算法的实现。第一，RRS()：需要输入一个图表、传播概率，输出一个随机设置反向可达集。第二，ris() ：使用第一个函数来生成大量的随机反向可及集，然后用于选择有影响力的种子集。我们首先加载一些包：

%matplotlib inline
import matplotlib.pyplot as plt
from random import uniform, seed
import numpy as np
import pandas as pd
import time
from igraph import *
import random
from collections import Counter

Create Random Reverse Reachable Set

The get_RRS() function below takes a network object and produces a RRR set from that network. The network is defined within a dataframe G, where each row represents a directed edge in the network, and the two columns ['source','target'] describe the source node and target node of a given edge, respectively. There are many ways to represent a network in Python (see this post which uses the popular igraph package to implement the CELF algorithm and this post which compares four different network implementations) but this setup is transparent and doesn’t require understanding a group of methods specific to a given graph modelling package. The function has three steps:

Randomly select the source node from the set of all potential source nodes contained in the source column of G.
Sample an instance g from the network G by comparing the propagation parameter p with a uniform random draw to simulate the sampling procedure for each edge, and extract those edges that remain into a new dataframe that represents the sampled graph. This is very similar to the method used in the IC() function below to simulate a propagation, which hints at why RIS works.
Produce the RR set itself through an iterative procedure that first finds all nodes with edges leading into source, which are then added to the RRS list object. The loop then finds all the nodes with edges leading into the new nodes found in the previous step and so on. The dataframe representation of the network makes this process very easy because we simply filter based on the target column and then select “new nodes” from the source column. Alternative representations would require some form of neighboring function.

从G的节点集合中随机选择源节点。
生成活跃边图g。
通过迭代过程生成RR集。

该过程首先找到有边通向源节点的所有节点，然后将其添加到RRS列表。
然后，循环将找到所有边缘都通向上一步中找到的新节点的节点，依此类推。
网络的数据帧表示使此过程非常容易，因为我们仅基于目标列进行过滤，然后从源列中选择“新节点”。其他的表示将需要某种形式的邻近功能。

def get_RRS(G,p):   
    """
    Inputs: G:  Ex2 dataframe of directed edges. Columns: ['source','target']
            p:  Disease propagation probability
    Return: A random reverse reachable set expressed as a list of nodes
    """
    
    # Step 1. Select random source node
    source = random.choice(np.unique(G['source']))
    
    # Step 2. Get an instance of g from G by sampling edges  
    g = G.copy().loc[np.random.uniform(0,1,G.shape[0]) < p]

    # Step 3. Construct reverse reachable set of the random source node
    new_nodes, RRS0 = [source], [source]   
    while new_nodes:
        
        # Limit to edges that flow into the source node
        temp = g.loc[g['target'].isin(new_nodes)]

        # Extract the nodes flowing into the source node
        temp = temp['source'].tolist()

        # Add new set of in-neighbors to the RRS
        RRS = list(set(RRS0 + temp))

        # Find what new nodes were added
        new_nodes = list(set(RRS) - set(RRS0))

        # Reset loop variables
        RRS0 = RRS[:]

    return(RRS)

注：

numpy.unique Returns the sorted unique elements of an array.
numpy.random.uniform

Create RIS algorithm

The ris() function implements the actual RIS solution procedure in two steps:
Generate a large collection R of mc random reverse reachable sets by looping over the get_RRS() function.
Run the “maximum greedy coverage” algorithm which simply finds the node that appears in the most RRR sets. Once this is found, the RRR sets featuring that node are removed from the collection R, and the process is repeated until k nodes are chosen and stored in the list object SEED.

ris()函数分两步实现了实际的RIS解决方案过程：

通过遍历 get_RRS() 函数，生成 mc 个随机反向可达集的大集合R。
运行“最大贪婪覆盖率”算法，该算法查找在大集合 R 中各反向可达集中出现最多的节点。一旦找到此节点，将从集合 R 中删除具有该节点的反向可达集，然后重复该过程，直到选择了 k 个节点并将其存储在列表对象 SEED 中为止。

def ris(G,k,p=0.5,mc=1000):    
    """
    Inputs: G:  Ex2 dataframe of directed edges. Columns: ['source','target']
            k:  Size of seed set
            p:  Disease propagation probability
            mc: Number of RRSs to generate
    Return: A seed set of nodes as an approximate solution to the IM problem
    """
    
    # Step 1. Generate the collection of random RRSs
    start_time = time.time()
    R = [get_RRS(G,p) for _ in range(mc)]

    # Step 2. Choose nodes that appear most often (maximum coverage greedy algorithm)
    SEED, timelapse = [], []
    for _ in range(k):
        
        # Find node that occurs most often in R and add to seed set
        flat_list = [item for sublist in R for item in sublist]
        seed = Counter(flat_list).most_common()[0][0]
        SEED.append(seed)
        
        # Remove RRSs containing last chosen seed 
        R = [rrs for rrs in R if seed not in rrs]
        
        # Record Time
        timelapse.append(time.time() - start_time)
    
    return(sorted(SEED),timelapse)