Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They’re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. Although many of these algorithms were invented for social network analysis, they have since found uses in a variety of industries and fields.
We’ll cover the following algorithms:
• Degree Centrality as a baseline metric of connectedness:
Measures the number of relationships a node has
example: Estimating a person’s popularity by looking at their in-degree and using their out-degree to estimate gregariousness
• Closeness Centrality for measuring how central a node is to the group, including two variations for disconnected groups
Calculates which nodes have the shortest paths to all other nodes
example: Finding the optimal location of new public services for maximum accessibility
• Betweenness Centrality for finding control points, including an alternative for approximation
Measures the number of shortest paths that pass through a node
example: Improving drug targeting by finding the control genes for specific diseases
• PageRank for understanding the overall influence, including a popular option for personalization
which node is the most important? Estimates a current node’s importance from its linked neighbors and their neighbors
example:Finding the most influential features for extraction in machine learning and ranking text for entity relevance in natural language processing.
Loading Data 从github 读取两个csv data
The following query imports nodes:
WITH "https://github.com/neo4j-graph-analytics/book/raw/master/data/" AS base
WITH base + "social-nodes.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
MERGE (:User {id: row.id});
And this query imports relationships:
WITH "https://github.com/neo4j-graph-analytics/book/raw/master/data/" AS base
WITH base + "social-relationships.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
MATCH (source:User {id: row.src})
MATCH (destination:User {id: row.dst})
MERGE (source)-[:FOLLOWS]->(destination);
Get some data 查看数据
MATCH (n1)-[r]->(n2)
RETURN r, n1, n2
【Degree Centrality】
when to use:
1. analyze influence by looking at the number of incoming and outgoing relationships,
2. find the “popularity” of individual nodes.
It works well when you’re concerned with immediate connectedness or near-term probabilities. However, Degree Centrality is also
3. applied to global analysis when you want to evaluate the minimum degree, maximum degree, mean degree, and standard deviation across the entire graph.
other related terminology:
the reach of a node : fair measure of importance. How many other nodes can it touch right now?
the degree of a node: the number of direct relationships it has, calculated for in-degree and out-degree
the average degree of a network=( total number of relationships / the total number of nodes)
it can be heavily skewed by high degree nodes.
the degree distribution: the probability that a randomly selected node will have a certain number of relationships.
MATCH (u:User)RETURN u.id AS name,
size((u)-[:FOLLOWS]->()) AS follows,
size((u)<-[:FOLLOWS]-()) AS followers,
size((u)-[:FOLLOWS]->() + (u)<-[:FOLLOWS]-()) AS total
ORDER BY followers DESC
Closeness Centrality
a way of detecting nodes that are able to spread information efficiently through a subgraph.
The measure of a node’s centrality is its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances from all other nodes.
terminology:
Closeness centrality:
1.calculates the sum of its distances to all other nodes, based on calculating the shortest paths between all pairs of nodes.
2.inverted to determine the closeness centrality score
normalized closeness centrality
normalize this score so that it represents the average length of the shortest paths rather than their sum. This adjustment allows comparisons of the closeness centrality of nodes of graphs of different sizes.
when to use:
when you need to know which nodes disseminate things the fastest. Using weighted relationships can be especially helpful in evaluating interaction speeds in communication and behavioral analyses.
Example use cases include:
• Uncovering individuals in very favorable positions to control and acquire vital
information and resources within an organization.
• As a heuristic for estimating arrival time in telecommunications and package delivery, where content flows through the shortest paths to a predefined target. It is also used to shed light on propagation through all shortest paths simultaneously, such as infections spreading through a local community
• Evaluating the importance of words in a document, based on a graph-based key phrase extraction process
calculate the closeness centrality using the FOLLOWS relationship type for nodes with the User label:
CALL gds.alpha.closeness.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS"
})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).id, centrality
ORDER BY centrality DESC;
If some subgraphs don't have connections: (improved formula)
detecting the importance of a node across the entire graph (包含不连接的subgroup)
1. Closeness Centrality Variation: Wasserman and Faust
an improved formula for calculating closeness for graphs with multiple subgraphs without connections between
those groups.
The result of this formula is a ratio of the fraction of nodes in the group that are reachable to the average distance from the reachable nodes.
Improved normalized centrality= (reachable other nodes/ all other nodes) * original normalized centrality
CALL gds.alpha.closeness.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS",
improved: true
})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).id, centrality
ORDER BY centrality DESC;
Closeness Centrality Variation: Harmonic Centrality
use inversed distance --> sum--> to avoid the issue of having infinite distance due to disconnected subgrouops
When calculating the closeness score for each node, rather than summing the distancesof a node to all other nodes, it sums the inverse of those distances. This means that infinite values become irrelevant.
The raw harmonic centrality for a node is calculated using the following formula:
As with closeness centrality, we can also calculate a normalized harmonic centrality with the following formula:
CALL gds.alpha.closeness.harmonic.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS"
})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).id, centrality
ORDER BY centrality DESC;
Betweenness Centrality
middleman/ bridge between groups and islands 最具传话连接能力的点
Sometimes it’s the middlemen that connect groups or the brokers who the most control over resources or the flow of information. Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph. It is typically used to find nodes that serve as a bridge from one part of a graph to another.
计算方法:
The Betweenness Centrality algorithm first calculates the shortest (weighted) path between every pair of nodes in a connected graph. Each node receives a score, based on the number of these shortest paths that pass through the node. The more shortest paths that a node lies on, the higher its score.
terminology:
Bridges and control points
A bridge in a network can be a node or a relationship. In a very simple graph, you can find them by looking for the node or relationship that, if removed, would cause a section of the graph to become disconnected.
Pivotal nodes play an important role in connecting other nodes—if you remove a pivotal node, the new shortest path for the original node pairs will be longer or more costly. This can be a consideration for evaluating single points of vulnerability.
Pivotal nodes lie on every shortest path between two nodes. Creating more shortest paths can reduce the number of pivotal nodes for uses such as risk mitigation.
The betweenness centrality of a node is calculated by adding the results of the following formula for all shortest paths:
when to use:
We use it to find bottlenecks, control points, and vulnerabilities.
Example use cases include:
• Identifying influencers in various organizations. Powerful individuals are not necessarily in management positions, but can be found in “brokerage positions” using Betweenness Centrality.
• Uncovering key transfer points in networks such as electrical grids. Counterintuitively, removal of specific bridges can actually improve overall robustness by “islanding” disturbances.
• Helping microbloggers spread their reach on Twitter, with a recommendation engine for targeting influencers.
caution:
Betweenness Centrality makes the assumption that all communication between nodes happens along the shortest path and with the same frequency, which isn’t always the case in real life.
CALL gds.alpha.betweenness.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS"
})
YIELD nodeId, centrality
RETURN gds.util.asNode(nodeId).id AS user, centrality
ORDER BY centrality DESC;
新版中,改动后如下:betweeness centrality 改为score
CALL gds.betweenness.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS"})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS user, score
ORDER BY score DESC;
PageRank
PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors. For example, having a few very powerful friends can make you more influential than having a lot of less powerful friends.
Influence
The intuition behind influence is that relationships to more important nodes contribute more to the influence of the node in question than equivalent connections to less important nodes. Measuring influence usually involves scoring nodes, often with weighted relationships, and then updating the scores over many iterations. Sometimes all nodes are scored, and sometimes a random selection is used as a representative distribution
Rank Sinks
There are two strategies used to avoid rank sinks. First, when a node is reached that
has no outgoing relationships, PageRank assumes outgoing relationships to all nodes.
Traversing these invisible links is sometimes called teleportation. Second, the damping
factor provides another opportunity to avoid sinks by introducing a probability for
direct link versus random node visitation. When you set d to 0.85, a completely random
node is visited 15% of the time.
When Should I Use PageRank?
PageRank is now used in many domains outside web indexing. Use this algorithm whenever you’re looking for broad influence over a network. For instance, if you’re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.
Example use cases include:
• Presenting users with recommendations of other accounts that they may wish to follow (Twitter uses Personalized PageRank for this). The algorithm is run over a graph that contains shared interests and common connections.
• Predicting traffic flow and human movement in public spaces or streets. The algorithm is run over a graph of road intersections, where the PageRank score reflects the tendency of people to park, or end their journey, on each street.
• As part of anomaly and fraud detection systems in the healthcare and insurance industries. PageRank helps reveal doctors or providers that are behaving in an unusual manner, and the scores are then fed into a machine learning algorithm.
Which node is the most important?( important, influential, have more influential connections)-
-considering all the nodes
CALL gds.pageRank.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS",
maxIterations: 20,
dampingFactor: 0.85
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS page, score
ORDER BY score DESC;
Personalized PageRank
: consider me as "Doug" , who are my most important connections
//have not yet exclue those Doug already knew
MATCH (A:User {id:"Doug"} )
CALL gds.pageRank.stream({
nodeProjection: "User",
relationshipProjection: "FOLLOWS",
maxIterations: 20,
dampingFactor: 0.85,
sourceNodes:[A]
})
YIELD nodeId,score
WHERE gds.util.asNode(nodeId).id <>"Doug"
RETURN gds.util.asNode(nodeId).id AS name, round(score*100)/100 as score
ORDER BY score DESC;