graph algorithm- Centrality

Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They’re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. Although many of these algorithms were invented for social network analysis, they have since found uses in a variety of industries and fields.

We’ll cover the following algorithms:


Degree Centrality as a baseline metric of connectedness:

                                Measures the number of relationships a node has

example: Estimating a person’s popularity by looking at their in-degree and using their out-degree to estimate gregariousness


Closeness Centrality for measuring how central a node is to the group, including two variations for disconnected groups

                                   Calculates which nodes have the shortest paths to all other nodes

example: Finding the optimal location of new public services for maximum accessibility


Betweenness Centrality for finding control points, including an alternative for approximation

                                             Measures the number of shortest paths that pass through a node

example: Improving drug targeting by finding the control genes for specific diseases


PageRank for understanding the overall influence, including a popular option for personalization

                    which node is the most important? Estimates a current node’s importance from its linked neighbors and their neighbors 

example:Finding the most influential features for extraction in machine learning and ranking text for entity relevance in natural language processing.



Loading Data 从github 读取两个csv data

The following query imports nodes:

WITH "https://github.com/neo4j-graph-analytics/book/raw/master/data/" AS base

WITH base + "social-nodes.csv" AS uri

LOAD CSV WITH HEADERS FROM uri AS row

MERGE (:User {id: row.id});


And this query imports relationships:

WITH "https://github.com/neo4j-graph-analytics/book/raw/master/data/" AS base

WITH base + "social-relationships.csv" AS uri

LOAD CSV WITH HEADERS FROM uri AS row

MATCH (source:User {id: row.src})

MATCH (destination:User {id: row.dst})

MERGE (source)-[:FOLLOWS]->(destination);


Get some data 查看数据

MATCH (n1)-[r]->(n2) 

RETURN r, n1, n2


【Degree Centrality】

when to use:

1. analyze influence by looking at the number of incoming and outgoing relationships, 

2. find the “popularity” of individual nodes. 

It works well when you’re concerned with immediate connectedness or near-term probabilities. However, Degree Centrality is also 

3. applied to global analysis when you want to evaluate the minimum degree, maximum degree, mean degree, and standard deviation across the entire graph.


other related terminology:

the reach of a node :  fair measure of importance. How many other nodes can it touch right now?

the degree of a node:  the number of direct relationships it has, calculated for in-degree and out-degree

the average degree of a network=( total number of relationships / the total number of nodes)

                                                     it can be heavily skewed by high degree nodes.

the degree distribution:  the probability that a randomly selected node will have a certain number of relationships.

MATCH (u:User)RETURN u.id AS name,

size((u)-[:FOLLOWS]->()) AS follows,

size((u)<-[:FOLLOWS]-()) AS followers,

size((u)-[:FOLLOWS]->() + (u)<-[:FOLLOWS]-()) AS total

ORDER BY followers DESC





Closeness Centrality

a way of detecting nodes that are able to spread information efficiently through a subgraph.

The measure of a node’s centrality is its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances from all other nodes.

terminology:

Closeness centrality:

1.calculates the sum of its distances to all other nodes, based on calculating the shortest paths between all pairs of nodes.

2.inverted to determine the closeness centrality score

normalized closeness centrality 

normalize this score so that it represents the average length of the shortest paths rather than their sum. This adjustment allows comparisons of the closeness centrality of nodes of graphs of different sizes.


when to use: 

when you need to know which nodes disseminate things the fastest. Using weighted relationships can be especially helpful in evaluating interaction speeds in communication and behavioral analyses.

Example use cases include:

• Uncovering individuals in very favorable positions to control and acquire vital

information and resources within an organization. 

• As a heuristic for estimating arrival time in telecommunications and package delivery, where content flows through the shortest paths to a predefined target. It is also used to shed light on propagation through all shortest paths simultaneously, such as infections spreading through a local community

• Evaluating the importance of words in a document, based on a graph-based key phrase extraction process


calculate the closeness centrality using the FOLLOWS relationship type for nodes with the User label:

CALL gds.alpha.closeness.stream({

nodeProjection: "User",

relationshipProjection: "FOLLOWS"

})

YIELD nodeId, centrality

RETURN gds.util.asNode(nodeId).id, centrality

ORDER BY centrality DESC;


If some subgraphs don't have connections: (improved formula)

detecting the importance of a node across the entire graph (包含不连接的subgroup)

1. Closeness Centrality Variation: Wasserman and Faust

an improved formula for calculating closeness for graphs with multiple subgraphs without connections between

those groups.

The result of this formula is a ratio of the fraction of nodes in the group that are reachable to the average distance from the reachable nodes.

Improved normalized centrality=  (reachable other nodes/ all other nodes) * original normalized centrality 



CALL gds.alpha.closeness.stream({

nodeProjection: "User",

relationshipProjection: "FOLLOWS",

improved: true

})

YIELD nodeId, centrality

RETURN gds.util.asNode(nodeId).id, centrality

ORDER BY centrality DESC;


Closeness Centrality Variation: Harmonic Centrality

use inversed distance --> sum--> to avoid the issue of having infinite distance due to disconnected subgrouops

When calculating the closeness score for each node, rather than summing the distancesof a node to all other nodes, it sums the inverse of those distances. This means that infinite values become irrelevant.

The raw harmonic centrality for a node is calculated using the following formula:

As with closeness centrality, we can also calculate a normalized harmonic centrality with the following formula:

CALL gds.alpha.closeness.harmonic.stream({

nodeProjection: "User",

relationshipProjection: "FOLLOWS"

})

YIELD nodeId, centrality

RETURN gds.util.asNode(nodeId).id, centrality

ORDER BY centrality DESC;


Betweenness Centrality

middleman/ bridge between groups and islands 最具传话连接能力的点


Sometimes it’s the middlemen that connect groups or the brokers who the most control over resources or the flow of information. Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph. It is typically used to find nodes that serve as a bridge from one part of a graph to another.

计算方法:

The Betweenness Centrality algorithm first calculates the shortest (weighted) path between every pair of nodes in a connected graph. Each node receives a score, based on the number of these shortest paths that pass through the node. The more shortest paths that a node lies on, the higher its score.

terminology:

Bridges and control points

A bridge in a network can be a node or a relationship. In a very simple graph, you can find them by looking for the node or relationship that, if removed, would cause a section of the graph to become disconnected.

Pivotal nodes play an important role in connecting other nodes—if you remove a pivotal node, the new shortest path for the original node pairs will be longer or more costly. This can be a consideration for evaluating single points of vulnerability.

Pivotal nodes lie on every shortest path between two nodes. Creating more shortest paths can reduce the number of pivotal nodes for uses such as risk mitigation.

The betweenness centrality of a node is calculated by adding the results of the following formula for all shortest paths:

when to use:

We use it to find bottlenecks, control points, and vulnerabilities.

Example use cases include:

• Identifying influencers in various organizations. Powerful individuals are not necessarily in management positions, but can be found in “brokerage positions” using Betweenness Centrality. 

• Uncovering key transfer points in networks such as electrical grids. Counterintuitively, removal of specific bridges can actually improve overall robustness by “islanding” disturbances. 

• Helping microbloggers spread their reach on Twitter, with a recommendation engine for targeting influencers. 


caution:

Betweenness Centrality makes the assumption that all communication between nodes happens along the shortest path and with the same frequency, which isn’t always the case in real life. 


CALL gds.alpha.betweenness.stream({

nodeProjection: "User",

relationshipProjection: "FOLLOWS"

})

YIELD nodeId, centrality

RETURN gds.util.asNode(nodeId).id AS user, centrality

ORDER BY centrality DESC;


新版中,改动后如下:betweeness centrality 改为score 

CALL gds.betweenness.stream({

  nodeProjection:  "User",

  relationshipProjection: "FOLLOWS"})

YIELD nodeId, score

RETURN gds.util.asNode(nodeId).id AS user, score

ORDER BY score DESC;


PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors. For example, having a few very powerful friends can make you more influential than having a lot of less powerful friends. 

Influence

The intuition behind influence is that relationships to more important nodes contribute more to the influence of the node in question than equivalent connections to less important nodes. Measuring influence usually involves scoring nodes, often with weighted relationships, and then updating the scores over many iterations. Sometimes all nodes are scored, and sometimes a random selection is used as a representative distribution


Rank Sinks


There are two strategies used to avoid rank sinks. First, when a node is reached that

has no outgoing relationships, PageRank assumes outgoing relationships to all nodes.

Traversing these invisible links is sometimes called teleportation. Second, the damping

factor provides another opportunity to avoid sinks by introducing a probability for

direct link versus random node visitation. When you set d to 0.85, a completely random

node is visited 15% of the time.


When Should I Use PageRank?

PageRank is now used in many domains outside web indexing. Use this algorithm whenever you’re looking for broad influence over a network. For instance, if you’re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.

Example use cases include:

• Presenting users with recommendations of other accounts that they may wish to follow (Twitter uses Personalized PageRank for this). The algorithm is run over a graph that contains shared interests and common connections.

• Predicting traffic flow and human movement in public spaces or streets. The algorithm is run over a graph of road intersections, where the PageRank score reflects the tendency of people to park, or end their journey, on each street. 

• As part of anomaly and fraud detection systems in the healthcare and insurance industries. PageRank helps reveal doctors or providers that are behaving in an unusual manner, and the scores are then fed into a machine learning algorithm.


Which node is the most important?( important, influential, have more influential connections)-

-considering all the nodes


CALL gds.pageRank.stream({

nodeProjection: "User",

relationshipProjection: "FOLLOWS",

maxIterations: 20,

dampingFactor: 0.85

})

YIELD nodeId, score

RETURN gds.util.asNode(nodeId).id AS page, score

ORDER BY score DESC;


Personalized PageRank

: consider me as "Doug" , who are my most important connections 


//have not yet exclue those Doug already knew

MATCH (A:User {id:"Doug"} )

CALL gds.pageRank.stream({

    nodeProjection: "User",

relationshipProjection: "FOLLOWS",

maxIterations: 20,

dampingFactor: 0.85,

sourceNodes:[A]

})

YIELD nodeId,score

WHERE gds.util.asNode(nodeId).id <>"Doug"

RETURN gds.util.asNode(nodeId).id  AS name, round(score*100)/100 as score

ORDER BY score DESC;

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,193评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,306评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 162,130评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,110评论 1 292
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,118评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,085评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,007评论 3 417
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,844评论 0 273
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,283评论 1 310
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,508评论 2 332
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,667评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,395评论 5 343
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,985评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,630评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,797评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,653评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,553评论 2 352

推荐阅读更多精彩内容