《权力的游戏》关系网络分析——基于 tidygraph 和 ggraph

! 本文摘自 Shirin's playgRound ! 本文仅为自译转载，任何观点都不代表个人观点！

不久之前，我曾经做过一个关于冰与火之歌人物关系的图谱分析。在这个分析中，我发现史塔克家族（尤其是Ned 和三傻[皮一下很开心]）和兰尼斯特家族（尤其是Tyrion)，是权利的游戏中，最主要的家族连接点。他们连接着很多故事线，也是整个故事的中心点。

在之前的 PO 文中，我使用了 igraphl 来描绘和计算关系矩阵。

但是现在有两个更好的包可以完成整个关系分析： tidyverse: tidygraph 和 ggraph。

所以这篇文章将使用这两个包来制作冰与火之歌/权利的游戏的任务关系分析图。（内容基于冰与火这个著作而非电视节目）

社会关系分析或者社会网络分析能带来什么？

关系分析能发现和挖掘社会或者专业领域社交网络中的关系。我们通常会问：每个网络中的人（节点）有多少关系连接（边）？

谁是连接数最多（最重要，最有影响力）的人？

紧密联系的人是否导致大的集群的存在？

是否有关键人物在集群之间拥有重要作用？

这些问题的答案通常可以帮助我们理解，人类是怎么在社会中交流和互动。

所以，我们怎么才能找到在网络中最重要的角色？简单来说，当一个人拥有最多关系或最多与之相连的人的时候，其重要性不言则明。同时也有一些其他的属性可以帮助我们寻找这些关键人物，例如节点中心度(node centrality).

冰与火之歌角色关系图

library(readr) # fast reading of csv files

library(tidyverse) # tidy data analysis

library(tidygraph) # tidy graph analysis

library(ggraph) # for plotting

数据

数据来源于 Github Repository，Andrew Beveridge:

Character Interaction Networks for George R. R. Martin’s “A Song of Ice and Fire” saga These networks were created by connecting two characters whenever their names (or nicknames) appeared within 15 words of one another in one of the books in “A Song of Ice and Fire.” The edge weight corresponds to the number of interactions. You can use this data to explore the dynamics of the Seven Kingdoms using network science techniques. For example, community detection finds coherent plotlines. Centrality measures uncover the multiple ways in which characters play important roles in the saga.

Andrew 已经做过一个关于冰与火之歌角色关系的分析，如果你感兴趣，也可以去他的网站浏览他的结论https://networkofthrones.wordpress.com

这里我不想复制他的分析，而是希望展示如何使用 tidygraph 和 ggraph。所以我并不会完全使用他的数据。

path <- "/Users/shiringlander/Documents/Github/Data/asoiaf/data/"

files <- list.files(path = path, full.names = TRUE)

files

## [1] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-all-edges.csv"

## [2] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-all-nodes.csv"

## [3] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book1-edges.csv"

## [4] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book1-nodes.csv"

## [5] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book2-edges.csv"

## [6] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book2-nodes.csv"

## [7] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book3-edges.csv"

## [8] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book3-nodes.csv"

## [9] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book4-edges.csv"

## [10] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book4-nodes.csv"

## [11] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book45-edges.csv"

## [12] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book45-nodes.csv"

## [13] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book5-edges.csv"

## [14] "/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book5-nodes.csv"

全书角色

首先要使用的是全书角色互动关系的数据。这里我不打算使用节点的数据，因为我发现关系（edge）的名称标记已经足够用来标注。当然，如果你希望使用更好的名词标识，可以使用节点数据。

cooc_all_edges <- read_csv(files[1])

因为书中有太多角色，而且很多都是小角色，所以我抽出前100位互动关系较多的角色。关系都是无向关系，所以没有冗余的Source-Target combination

main_ch <- cooc_all_edges %>%

select(-Type) %>%

gather(x, name, Source:Target) %>%

group_by(name) %>%

summarise(sum_weight = sum(weight)) %>%

ungroup()

main_ch_l <- main_ch %>%

arrange(desc(sum_weight)) %>%

top_n(100, sum_weight)

main_ch_l

## # A tibble: 100 x 2

## name sum_weight

## 1 Tyrion-Lannister 2873

## 2 Jon-Snow 2757

## 3 Cersei-Lannister 2232

## 4 Joffrey-Baratheon 1762

## 5 Eddard-Stark 1649

## 6 Daenerys-Targaryen 1608

## 7 Jaime-Lannister 1569

## 8 Sansa-Stark 1547

## 9 Bran-Stark 1508

## 10 Robert-Baratheon 1488

## # ... with 90 more rows

cooc_all_f <- cooc_all_edges %>%

filter(Source %in% main_ch_l$name & Target %in% main_ch_l$name)

tidygraph 和 ggraph

两个工具包都来自于 Thomas Lin Pedersen：

With tidygraph I set out to make it easier to get your data into a graph and perform common transformations on it, but the aim has expanded since its inception. The goal of tidygraph is to empower the user to formulate complex questions regarding relational data as simple steps, thus enabling them to retrieve insights directly from the data itself. The central idea this all boils down to is this: you don’t have to plot a network to understand it. While I absolutely love the field of network visualisation, it is in many ways overused in data science — especially when it comes to extracting knowledge from a network. Just as you don’t need a plot to tell you which car in a dataset is the fastest, you don’t need a plot to tell you which pair of friends are the closest. What you do need, instead of a plot, is a tool that allow you to formulate your question into a logic sequence of operations. For many people in the world of rectangular data, this tool is increasingly dplyr (and friends), and I do hope that tidygraph can take on the same role in the world of relational data. https://www.data-imaginist.com/2018/tidygraph-1-1-a-tidy-hope/

首先，将边的表格转换为 tbl_graph 格式。这里使用 tidygrpah 中的as_tbl_graph()函数，其可以输入 data.frame,matrix,dendrogram,igraph,etc.

Underneath the hood of tidygraph lies the well-oiled machinery of igraph, ensuring efficient graph manipulation. Rather than keeping the node and edge data in a list and creating igraph objects on the fly when needed, tidygraph subclasses igraph with the tbl_graph class and simply exposes it in a tidy manner. This ensures that all your beloved algorithms that expects igraph objects still works with tbl_graph objects. Further, tidygraph is very careful not to override any of igraphs exports so the two packages can coexist quite happily. https://www.data-imaginist.com/2018/tidygraph-1-1-a-tidy-hope/

有很多 node ranking 的选择（可以去?node_rank查看详细列表）。我这里尝试TSP 解算器最小化哈密顿回路的方法 Minimize hamiltonian path length using a travelling salesperson solver):

as_tbl_graph(cooc_all_f, directed = FALSE) %>%

activate(nodes) %>%

mutate(n_rank_trv = node_rank_traveller()) %>%

arrange(n_rank_trv)

## # A tbl_graph: 100 nodes and 798 edges

## #

## # An undirected simple graph with 1 component

## #

## # Node Data: 100 x 2 (active)

## name n_rank_trv

## 1 Janos-Slynt 1

## 2 Aemon-Targaryen-(Maester-Aemon) 2

## 3 Jeor-Mormont 3

## 4 Samwell-Tarly 4

## 5 Qhorin-Halfhand 5

## 6 Ygritte 6

## # ... with 94 more rows

## #

## # Edge Data: 798 x 5

## from to Type id weight

## 1 2 75 Undirected 43 7

## 2 2 76 Undirected 44 4

## 3 2 73 Undirected 52 3

## # ... with 795 more rows

Centrality 中心度

中心度用来表示节点入度和出度的数量。高度中心化的网络中，且之有较少的节点拥有较大数量的边。低度中心化的网络中拥有较多的节点，同时节点度相对小而平均。而节点中心度衡量了节点在网络中的重要程度。

This version adds 19(!) new ways to define the notion of centrality along with a manual version where you can mix and match different distance measures and summation strategies opening up the world to even more centrality scores. All of this wealth of centrality comes from the netrankr package that provides a framework for defining and calculating centrality scores. If you use centrality measures somewhere in your analysis I cannot recommend the vignettes provided by netrankr enough as they provide a fundamental intuition about the nature of such measures and how they can/should be used. https://www.data-imaginist.com/2018/tidygraph-1-1-a-tidy-hope/

可以使用?centrality查看所有关于中心度计算的所有方法。这里我们使用centrality_degree()。

## # A tbl_graph: 100 nodes and 798 edges

## #

## # An undirected simple graph with 1 component

## #

## # Node Data: 100 x 2 (active)

## name neighbors

## 1 Tyrion-Lannister 54.

## 2 Cersei-Lannister 49.

## 3 Joffrey-Baratheon 49.

## 4 Robert-Baratheon 47.

## 5 Jaime-Lannister 45.

## 6 Sansa-Stark 44.

## # ... with 94 more rows

## #

## # Edge Data: 798 x 5

## from to Type id weight

## 1 41 42 Undirected 43 7

## 2 41 60 Undirected 44 4

## 3 41 63 Undirected 52 3

## # ... with 795 more rows

组和聚类 Grouping and clustring

Another common operation is to group nodes based on the graph topology, sometimes referred to as community detection based on its commonality in social network analysis. All clustering algorithms from igraph is available in tidygraph using the group_* prefix. All of these functions return an integer vector with nodes (or edges) sharing the same integer being grouped together. https://www.data-imaginist.com/2017/introducing-tidygraph/

可以使用?group_graph查看所有分组和聚类的方法。这里我使用 group_infomap()：Group nodes by minimizing description length using.

as_tbl_graph(cooc_all_f, directed = FALSE) %>%

activate(nodes) %>%

mutate(group = group_infomap()) %>%

arrange(-group)

## # A tbl_graph: 100 nodes and 798 edges

## #

## # An undirected simple graph with 1 component

## #

## # Node Data: 100 x 2 (active)

## name group

## 1 Arianne-Martell 7

## 2 Doran-Martell 7

## 3 Davos-Seaworth 6

## 4 Melisandre 6

## 5 Selyse-Florent 6

## 6 Stannis-Baratheon 6

## # ... with 94 more rows

## #

## # Edge Data: 798 x 5

## from to Type id weight

## 1 32 33 Undirected 43 7

## 2 32 34 Undirected 44 4

## 3 32 36 Undirected 52 3

## # ... with 795 more rows

复制代码

查询节点类型（query node type）

我们也可以查询节点类型，使用?node_types 来查看所有内容。

These functions all lets the user query whether each node is of a certain type. All of the functions returns a logical vector indicating whether the node is of the type in question. Do note that the types are not mutually exclusive and that nodes can thus be of multiple types.

这里我使用 node_is_center 和 node_is_keyplayer 来查询图中前10名重要关键人物。你也可以在 influenceR 包中找到更多关于 node_is_keyplayer 函数的内容。

The “Key Player” family of node importance algorithms (Borgatti 2006) involves the selection of a metric of node importance and a combinatorial optimization strategy to choose the set S of vertices of size k that maximize that metric. This function implements KPP-Pos, a metric intended to identify k nodes which optimize resource diffusion through the net … https://www.data-imaginist.com/2017/introducing-tidygraph/

as_tbl_graph(cooc_all_f, directed = FALSE) %>%

activate(nodes) %>%

mutate(dist_to_center = node_distance_to(node_is_center()))

## # A tbl_graph: 100 nodes and 798 edges

## #

## # An undirected simple graph with 1 component

## #

## # Node Data: 100 x 2 (active)

## name dist_to_center

## 1 Aemon-Targaryen-(Maester-Aemon) 1.

## 2 Aeron-Greyjoy 2.

## 3 Aerys-II-Targaryen 1.

## 4 Alliser-Thorne 1.

## 5 Arianne-Martell 2.

## 6 Arya-Stark 1.

## # ... with 94 more rows

## #

## # Edge Data: 798 x 5

## from to Type id weight

## 1 1 4 Undirected 43 7

## 2 1 13 Undirected 44 4

## 3 1 28 Undirected 52 3

## # ... with 795 more rows

复制代码

Edge betweeness

和节点的指标一样，边或者关系上，我们也可以获得很多指标。Betweeness 就是用来表示任意两个节点间最短路径的关系指标。[皮一下：小世界理论？]。tidygraph 中可以查看edge_type的详细内容

as_tbl_graph(cooc_all_f, directed = FALSE) %>%

activate(edges) %>%

mutate(centrality_e = centrality_edge_betweenness())

## # A tbl_graph: 100 nodes and 798 edges

## #

## # An undirected simple graph with 1 component

## #

## # Edge Data: 798 x 6 (active)

## from to Type id weight centrality_e

## 1 1 4 Undirected 43 7 1.00

## 2 1 13 Undirected 44 4 30.2

## 3 1 28 Undirected 52 3 42.1

## 4 1 32 Undirected 53 20 0.

## 5 1 34 Undirected 54 5 35.2

## 6 1 41 Undirected 56 5 18.9

## # ... with 792 more rows

## #

## # Node Data: 100 x 1

## name

## 1 Aemon-Targaryen-(Maester-Aemon)

## 2 Aeron-Greyjoy

## 3 Aerys-II-Targaryen

## # ... with 97 more rows

复制代码

完整的内容

让我们组合起来之前的东西：

cooc_all_f_graph <- as_tbl_graph(cooc_all_f, directed = FALSE) %>%

mutate(n_rank_trv = node_rank_traveller(),

neighbors = centrality_degree(),

group = group_infomap(),

center = node_is_center(),

dist_to_center = node_distance_to(node_is_center()),

keyplayer = node_is_keyplayer(k = 10)) %>%

activate(edges) %>%

filter(!edge_is_multiple()) %>%

mutate(centrality_e = centrality_edge_betweenness())

复制代码

我们也可以将节点和边的表转换为tibble:

cooc_all_f_graph %>%

activate(nodes) %>% # %N>%

as.tibble()

## # A tibble: 100 x 7

## name n_rank_trv neighbors group center dist_to_center keyplayer

## 1 Aemon-Targa… 45 13. 2 FALSE 1. FALSE

## 2 Aeron-Greyj… 21 5. 5 FALSE 2. FALSE

## 3 Aerys-II-Ta… 11 12. 1 FALSE 1. FALSE

## 4 Alliser-Tho… 48 13. 2 FALSE 1. FALSE

## 5 Arianne-Mar… 29 4. 7 FALSE 2. FALSE

## 6 Arya-Stark 79 37. 1 FALSE 1. FALSE

## 7 Asha-Greyjoy 20 7. 5 FALSE 1. FALSE

## 8 Balon-Greyj… 18 11. 5 FALSE 2. FALSE

## 9 Barristan-S… 54 23. 3 FALSE 1. FALSE

## 10 Belwas 52 6. 3 FALSE 2. FALSE

## # ... with 90 more rows

cooc_all_f_graph %>%

activate(edges) %>% # %E>%

as.tibble()

## # A tibble: 798 x 6

## from to Type id weight centrality_e

## 1 1 4 Undirected 43 7 1.00

## 2 1 13 Undirected 44 4 30.2

## 3 1 28 Undirected 52 3 42.1

## 4 1 32 Undirected 53 20 0.

## 5 1 34 Undirected 54 5 35.2

## 6 1 41 Undirected 56 5 18.9

## 7 1 42 Undirected 57 25 0.

## 8 1 48 Undirected 58 110 0.

## 9 1 58 Undirected 60 5 24.5

## 10 1 71 Undirected 62 5 17.0

## # ... with 788 more rows

复制代码

ggraph 画图

ggraph is an extension of ggplot2 aimed at supporting relational data structures such as networks, graphs, and trees. While it builds upon the foundation of ggplot2 and its API it comes with its own self-contained set of geoms, facets, etc., as well as adding the concept of layouts to the grammar. https://github.com/thomasp85/ggraph

首先，我需要定义一层 layout, 有很多 options for layout,这里我要使用 Fruchterman-Reingold algorithm。

layout <- create_layout(cooc_all_f_graph,

layout = "fr")

复制代码

剩下的工作就比较类似于 ggplot2的函数，只不过我们会用到特殊的函数来绘制网络。比如 geom_edge_density 可以用来画高密度的阴影区域，geom_edge_lin 来连接节点，geom_node_point 来绘制节点，同时 geom_node_text 用来标注节点名称。你也可以在手册中找到更详细的内容。

graph(layout) +

geom_edge_density(aes(fill = weight)) +

geom_edge_link(aes(width = weight), alpha = 0.2) +

geom_node_point(aes(color = factor(group)), size = 10) +

geom_node_text(aes(label = name), size = 8, repel = TRUE) +

scale_color_brewer(palette = "Set1") +

theme_graph() +

labs(title = "A Song of Ice and Fire character network",

subtitle = "Nodes are colored by group")

复制代码

Interestingly, many of the groups reflect the narrative perfectly: the men from the Night’s Watch are grouped together with the Wildlings, Stannis, Davos, Selyse and Melisandre form another group, the Greyjoys, Bran’s group in Winterfell before they left for the North, Dany and her squad and the Martells (except for Quentyn, who “belongs” to Dany – just like in the books ;-)). The big group around the remaining characters is the only one that’s not split up very well.

下图使用了 RColorBrewer palette “Set1”:

该图中我画出了最中心的人物，同时节点的大小表示各个节点到中心的距离。书中最重要的两个角色就是Robert 和 Tyrion. (马丁大爷反正说他最喜欢小恶魔【笑】）

各个书中的角色

后面该作者也做了各个书中的角色的网络图。

The second data set I am going to use is a comparison of character interactions in the five books.

A little node on the side: My original plan was to loop over the separate edge files for each book, concatenate them together with the information from which book they are and then plot them via faceting. This turned out to be a bad solution because I wanted to show the different key-players in each of the five books. So, instead of using one joined graph, I created separate graphs for every book and used the bind_graphs() and facet_nodes() functions to plot them together.

for (i in 1:5) {

cooc <- read_csv(paste0("/Users/shiringlander/Documents/Github/Data/asoiaf/data//asoiaf-book", i, "-edges.csv")) %>%

mutate(book = paste0("book_", i)) %>%

filter(Source %in% main_ch_l$name & Target %in% main_ch_l$name)

assign(paste0("coocs_book_", i), cooc)

}

The concepts are the same as above, here I want to know the key-players in each book:

cooc_books_1_graph <- as_tbl_graph(coocs_book_1, directed = FALSE) %>%

mutate(book = "Book 1: A Game of Thrones",