推荐系统 Implementation(2)IMM 算法参考

Reverse Influence Sampling in Python

反向影响力采样(RIS)的python实现
翻译自Hautahi Kingi博客

The influence maximization (IM) problem seeks a set of seed nodes in a network to maximize the expected number of nodes activated via an influence cascade initiated at that seed set. A previous post compared two IM algorithms - the Greedy algorithm of Kempe et al. (2003) and the CELF algorithm of Leskovec et al. (2007). For years, CELF (and its modified CELF++ version by Goyal et al. 2011) were the fastest IM algorithms with theoretically guaranteed performance bounds. The subsequent literature focussed largely on improving computational efficiency with heuristics that sacrificed theoretical guarantees in favor of speed. More recently, however, a new approach to the problem - Reverse Influence Sampling (RIS) - has emerged which is both fast and theoretically guaranteed, and is now one of the state-of-the-art IM methods. This post walks through an implementation of the RIS algorithm in Python and compares its solutions and computational speed to the CELF algorithm described in the previous post.

影响力最大化(IM)问题寻求网络中的一组种子节点,以最大化通过在该种子集激活的影响级联模型下的预期节点数。 先前的文章比较了两种IM算法:Kempe等人的Greedy算法和Leskovec等人的CELF算法。 多年来,CELF(以及Goyal等人于2011年修改的CELF ++版本)是最快的IM算法,具有理论上可保证的性能范围。 随后的文献主要侧重于通过启发式(heuristics)来提高计算效率,而启发式则牺牲了对速度的理论保证。 然而,最近出现了一种新的解决方法:反向影响采样(RIS),该方法既快速又在理论上得到了保证,现在已成为最先进的IM方法之一。 这篇文章逐步介绍了RIS算法在Python中的实现,并将其解决方案和计算速度与上一篇文章中描述的CELF算法进行了比较。

The Reverse Influence Sampling Algorithm

The key insight enabling the RIS family of algorithms is the concept of a random reverse reachable set introduced to the IM problem by Borgs et al. (2014). A reverse reachable (RR) set for an arbitrary node v is generated by first sampling a graph g from the distribution generated by removing each edge e according to its propagation probability 1−pe and then taking the set of nodes in g that can “reach” v. A random reverse reachable (RRR) set is simply a RR set for a node selected uniformly at random. A visual example is presented below. Edges are sampled from the original network G on the left according to each edge’s assigned probability (the code in this post assumes that all edges share the same propagation probability so that pe=p). This produces a sampled network g in the middle figure, in which the edges with high probability tend to be picked. A random node D is selected and the resulting RRR set consists of those nodes with a directed path to D, which are surrounded by dashed lines.

RIS系列算法的关键见解是Borgs等人将随机反向可达集的概念引入IM问题。生成任意节点的反向可达集首先要根据传播概率(propagation probability)移除每个边e后产生的分布,对图g进行采样。其次需要取g中可以到达v的节点集合。随机反向可达集(RRR set)只是针对随机、均匀选择的节点的反向可达集。
下面是一个可视示例。 根据每个边缘分配的概率从左边的原始网络G中抽取边(本文中的代码假定所有边缘共享相同的传播概率,因此pe = p)。 这将在中间图形中生成一个采样网络g,其中倾向于选择具有较高概率的边缘。 选择一个随机节点D,得到的随机反向可达集由具有指向D的定向路径的那些节点组成,这些节点被虚线包围。

Intuitively, if a node u appears in a RR set of another node v, then there is a directed path from u to v. A diffusion process from a seed set containing u therefore has some probability of activating v. This connection between RR sets and node activations is formalized in a lemma, which states that the probability that a diffusion process from any seed set S will activate any node v is equal to the probability that at least one node in S is contained within a RR set for v (see Observation 3.2 in Borgs et al. 2014). Based on this lemma, the family of RIS algorithms proceed in two steps:

  1. Generate a set R of many independent RRR sets.
  2. Select k nodes to cover the maximum number of RRR sets in R using the standard greedy algorithm, which obtains a (1−1/e−ϵ)-approximate solution to the problem.

直观地说,如果一个节点u出现在另一个节点v的反向可达集中,那么就有一个从u到v的有向路径。因此,包含u的种子集在扩散过程有激活v的可能性。反向可达集和节点激活之间的这种联系在一个引理中形式化,这表明种子集合S的扩散过程将激活节点v的概率等于反向可达集中包含S中至少一个节点的概率。 基于这个引理,RIS系列算法分为两个步骤:

  1. 生成许多独立的反向可达集构成的集合R。
  2. 使用标准贪婪算法选择k个节点以覆盖R中尽可能多的反向可达集,该算法可获得该问题的(1 - 1/e - ϵ)近似解。

This approach works because for any seed set S, the fraction of RRR sets in R covered by S is an unbiased estimator of the spread of S (due to the lemma described above). Therefore, a seed set that covers a large number of RR sets in R is likely to have a large expected influence, which makes it a good solution to the IM problem.

该方法之所以有效,是因为对于任何种子集S,由S覆盖的R中的反向可达集的比例都是S传播的无偏估计(由于上述引理)。 因此覆盖R中大量反向可达集的种子集可能具有较大的预期影响,这使其成为IM问题的良好解决方案。

The key to the very high computational performance of the RIS algorithm is that, unlike Greedy or CELF, it doesn’t repeat the spread computation procedure to incrementally construct a solution. Instead, it performs all the Monte Carlo simulations/sampling up front to construct R, from which it then selects the entire seed set. The method therefore avoids the “wasted” spread computations of the greedy algorithm and most of its successors because the generated RRR sets are used to inform the spread of all nodes within the network.

RIS算法非常高的计算性能的关键是,不像贪婪或CELF,它不重复扩展的计算过程来增量地构造一个解决方案。取而代之的是,它执行所有的蒙特卡罗模拟/采样来构建R,然后从其中选择整个种子集。因此,该方法避免了贪婪算法及其大多数后继算法的“浪费的”扩展计算,因为生成的反向可达集于通知网络内所有节点的扩展。

Below, I describe the implementation of the RIS algorithm with two separate functions. The first, get_RRS(), takes a graph and propagation probability and spits out a random reverse reachable set. The second, ris(), uses the first function to generate a large number of random reverse reachable sets, which are then used to select the influential seed set. We start by loading some packages:

下面,我将用两个单独的函数描述RIS算法的实现。第一,RRS():需要输入一个图表、传播概率,输出一个随机设置反向可达集。第二,ris() :使用第一个函数来生成大量的随机反向可及集,然后用于选择有影响力的种子集。我们首先加载一些包:

%matplotlib inline
import matplotlib.pyplot as plt
from random import uniform, seed
import numpy as np
import pandas as pd
import time
from igraph import *
import random
from collections import Counter

Create Random Reverse Reachable Set

The get_RRS() function below takes a network object and produces a RRR set from that network. The network is defined within a dataframe G, where each row represents a directed edge in the network, and the two columns ['source','target'] describe the source node and target node of a given edge, respectively. There are many ways to represent a network in Python (see this post which uses the popular igraph package to implement the CELF algorithm and this post which compares four different network implementations) but this setup is transparent and doesn’t require understanding a group of methods specific to a given graph modelling package. The function has three steps:

  1. Randomly select the source node from the set of all potential source nodes contained in the source column of G.

  2. Sample an instance g from the network G by comparing the propagation parameter p with a uniform random draw to simulate the sampling procedure for each edge, and extract those edges that remain into a new dataframe that represents the sampled graph. This is very similar to the method used in the IC() function below to simulate a propagation, which hints at why RIS works.

  3. Produce the RR set itself through an iterative procedure that first finds all nodes with edges leading into source, which are then added to the RRS list object. The loop then finds all the nodes with edges leading into the new nodes found in the previous step and so on. The dataframe representation of the network makes this process very easy because we simply filter based on the target column and then select “new nodes” from the source column. Alternative representations would require some form of neighboring function.

  1. 从G的节点集合中随机选择源节点。
  2. 生成活跃边图g。
  3. 通过迭代过程生成RR集。
  • 该过程首先找到有边通向源节点的所有节点,然后将其添加到RRS列表。
  • 然后,循环将找到所有边缘都通向上一步中找到的新节点的节点,依此类推。
  • 网络的数据帧表示使此过程非常容易,因为我们仅基于目标列进行过滤,然后从源列中选择“新节点”。 其他的表示将需要某种形式的邻近功能。
def get_RRS(G,p):   
    """
    Inputs: G:  Ex2 dataframe of directed edges. Columns: ['source','target']
            p:  Disease propagation probability
    Return: A random reverse reachable set expressed as a list of nodes
    """
    
    # Step 1. Select random source node
    source = random.choice(np.unique(G['source']))
    
    # Step 2. Get an instance of g from G by sampling edges  
    g = G.copy().loc[np.random.uniform(0,1,G.shape[0]) < p]

    # Step 3. Construct reverse reachable set of the random source node
    new_nodes, RRS0 = [source], [source]   
    while new_nodes:
        
        # Limit to edges that flow into the source node
        temp = g.loc[g['target'].isin(new_nodes)]

        # Extract the nodes flowing into the source node
        temp = temp['source'].tolist()

        # Add new set of in-neighbors to the RRS
        RRS = list(set(RRS0 + temp))

        # Find what new nodes were added
        new_nodes = list(set(RRS) - set(RRS0))

        # Reset loop variables
        RRS0 = RRS[:]

    return(RRS)

注:

Create RIS algorithm

The ris() function implements the actual RIS solution procedure in two steps:
Generate a large collection R of mc random reverse reachable sets by looping over the get_RRS() function.
Run the “maximum greedy coverage” algorithm which simply finds the node that appears in the most RRR sets. Once this is found, the RRR sets featuring that node are removed from the collection R, and the process is repeated until k nodes are chosen and stored in the list object SEED.

ris()函数分两步实现了实际的RIS解决方案过程:

  • 通过遍历 get_RRS() 函数,生成 mc 个随机反向可达集的大集合R。
  • 运行“最大贪婪覆盖率”算法,该算法查找在大集合 R 中各反向可达集中出现最多的节点。 一旦找到此节点,将从集合 R 中删除具有该节点的反向可达集,然后重复该过程,直到选择了 k 个节点并将其存储在列表对象 SEED 中为止。
def ris(G,k,p=0.5,mc=1000):    
    """
    Inputs: G:  Ex2 dataframe of directed edges. Columns: ['source','target']
            k:  Size of seed set
            p:  Disease propagation probability
            mc: Number of RRSs to generate
    Return: A seed set of nodes as an approximate solution to the IM problem
    """
    
    # Step 1. Generate the collection of random RRSs
    start_time = time.time()
    R = [get_RRS(G,p) for _ in range(mc)]

    # Step 2. Choose nodes that appear most often (maximum coverage greedy algorithm)
    SEED, timelapse = [], []
    for _ in range(k):
        
        # Find node that occurs most often in R and add to seed set
        flat_list = [item for sublist in R for item in sublist]
        seed = Counter(flat_list).most_common()[0][0]
        SEED.append(seed)
        
        # Remove RRSs containing last chosen seed 
        R = [rrs for rrs in R if seed not in rrs]
        
        # Record Time
        timelapse.append(time.time() - start_time)
    
    return(sorted(SEED),timelapse)
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,928评论 6 509
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,748评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,282评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,065评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,101评论 6 395
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,855评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,521评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,414评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,931评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,053评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,191评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,873评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,529评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,074评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,188评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,491评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,173评论 2 357