快速导入十亿数据到hugegraph图数据库

在前面学习了《快速入门hugegraph图数据库》和《hugegraph图数据库概念详解》之后，大家一定想导入一定规模的真实数据到hugegraph练练手，本文就以Stanford的公开数据为例，教大家如何快速导入10亿+的数据到hugegraph图数据库。

1. 环境准备

导入数据到hugegraph之前需要准备好一些必要环境，包括：安装服务hugegraph-server和下载导入工具hugegraph-loader，请读者先根据文档安装hugegraph-server，下载hugegraph-loader，hugegraph-server和hugegraph-loader在同一台机器即可。

2. 数据准备

2.1 原始数据下载

本文以Stanford的公开数据集Friendster为例，该数据约31G左右，请大家自行去下载该数据。

下载完之后，我们来看看文件内容是什么样的。

前10行

$ head -10 com-friendster.ungraph.txt
# Undirected graph: ../../data/output/friendster.txt
# Friendster
# Nodes: 65608366 Edges: 1806067135
# FromNodeId    ToNodeId
101 102
101 104
101 107
101 125
101 165
101 168

后10行

$ tail -10 com-friendster.ungraph.txt
124802963   124804978
124802963   124814064
124804978   124805174
124804978   124805533
124804978   124814064
124805174   124805533
124806381   124806684
124806596   124809830
124814064   124829667
124820359   124826374

可以看到，文件结构很简单，每一行代表一条边，其包含两列，第一列是源顶点Id，第二列是目标顶点Id，两列之间以\t分隔。另外，文件最上面几行是一些概要信息，它说明了文件共有65608366个顶点，1806067135条边（行）。而且从文件的前面10行和顶点数中都可以看出，这1806067135行中有很多顶点是重复出现的。当然，这是由于文件本身无法描述图结构导致的。

了解过hugegraph-loader的读者应该知道，hugegraph-loader暂时不支持在读一次文件的时候既导入顶点又导入边，所以我们需要对边文件做一下处理，将所有的顶点Id去重后，输出到一个单独的顶点文件里面，这样hugegraph-loader就可以分别导入顶点和边了。

2.2 数据处理

这里数据处理的关键在于去重，在不考虑数据量的情况下，我们可以按照以下步骤去重并写入到新文件：

定义一个内存的set容器，便于判断某个Id是否存在
按行读取源文件，每一行解析出两个整型Id
对每个Id，先判断set容器中是否包含它，如果不包含，则加入到容器，并写入到新文件中

依靠内存的set容器我们就能实现去重，这是数据处理的核心思想。但是有一个问题需要考虑到，那就是set容器是否足够放下所有不重复的顶点Id，我们可以计算一下：

// 65608366个顶点Id
// 每个顶点Id是整型，即32字节
(65608366 * 32) / (1024 * 1024 * 1024) = 1.9G

很幸运，目前绝大多数的机器的内存都是能放得下1.9G的数据的，除非你已经十几年没有换过电脑了，所以大家可以自己写一个脚本按照我上面的逻辑快速地实现去重。

不过，我下面还是给大家介绍一种更加通用一点的处理方案，以免下一次换了一个数据集，而那个数据集的顶点Id占的内存是3.9G、5.9G或7.9G，这时，估计就有一部分人的机器装不下了。

下面我要介绍的这种方案在处理海量数据领域颇为常见，其核心思想是分而治之：

将原始的全部顶点Id分成较均匀的若干份，保证在每份之间没有重复的，在每份内部允许有重复的；
对每一份文件，应用上面的去重方法。

那如何才能将全部顶点Id分成较均匀的若干份呢？由于顶点Id都是连续的数字，我们可以做求余哈希，将所有余数相同的顶点Id写到一个文件中。比如我们决定分成10份，那可以创建编号为0-9的10个文件，将所有顶点Id除以10求余，余数为0的写到编号为0的文件，余数为1的写到编号为1的文件，以此类推。

我已经按照上面的逻辑写好了脚本，代码如下：

#!/usr/bin/python
# coding=utf-8


def ensure_file_exist(shard_file_dict, shard_prefix, index):
    if not (shard_file_dict.has_key(index)):
        name = shard_file_path + shard_prefix + str(index)
        shard_file = open(name, "w")
        shard_file_dict[index] = shard_file

if __name__ == '__main__':

    raw_file_path = "path/raw_file.txt"
    output_file_path = "path/de_dup.txt"
    shard_file_path = "path/shard/"
    shard_prefix = "shard_"
    shard_count = 100
    shard_file_dict = {}

    # Split into many shard files
    with open(raw_file_path, "r+") as raw_file:
        # Read next line
        for raw_line in raw_file:
            # Skip comment line
            if raw_line.startswith('#'):
                continue
            parts = raw_line.split('\t')
            assert len(parts) == 2

            source_node_id = int(parts[0])
            target_node_id = int(parts[1])
            # Calculate the residue by shard_count
            source_node_residue = source_node_id % shard_count
            target_node_residue = target_node_id % shard_count

            # Create new file if it doesn't exist
            ensure_file_exist(shard_file_dict, shard_prefix, source_node_residue)
            ensure_file_exist(shard_file_dict, shard_prefix, target_node_residue)

            # Append to file with corresponding index
            shard_file_dict[source_node_residue].write(str(source_node_id) + '\n')
            shard_file_dict[target_node_residue].write(str(target_node_id) + '\n')

    print "Split original file info %s shard files" % shard_count

    # Close all files
    for shard_file in shard_file_dict.values():
        shard_file.close()

    print "Prepare duplicate and merge shard files into %s" % output_file_path
    merge_file = open(output_file_path, "w")
    line_count = 0

    # Deduplicate and merge into another file
    for index in shard_file_dict.keys():
        name = shard_file_path + shard_prefix + str(index)
        with open(name, "r+") as shard_file:
            elems = {}
            # Read next line
            for raw_line in shard_file:
                # Filter duplicate elems
                if not elems.has_key(raw_line):
                    elems[raw_line] = ""
                    merge_file.write(raw_line)
                    line_count += 1
        print "Processed shard file %s" % name

    merge_file.close()
    print "Processed all shard files and merge into %s" % merge_file
    print "%s lines after processing the file" % line_count

    print "Finished"

在使用这个脚本之前，需要修改raw_file_path、output_file_path、shard_file_path为你自己路径。

处理完之后，我们再看看去重后的顶点文件

$ head -10 com-friendster.ungraph.vertex.txt
1007000
310000
1439000
928000
414000
1637000
1275000
129000
2537000
5356000

看一下文件有多少行

$ wc -l com-friendster.ungraph.vertex.txt
65608366 com-friendster.ungraph.vertex.txt

可以看到，确实是与文件描述相符的。

除了我说的这种方法外，肯定还有其他的处理办法，比如大数据处理神器：MapReduce，大家可以自行选择，只要能提取顶点Id并去重就行。

3. 导入准备

3.1 构建图模型

由于顶点和边除了Id外，都没有其他的属性，所以图的schema其实很简单。

schema.propertyKey("id").asInt().ifNotExist().create();
// 使用Id作为主键
schema.vertexLabel("person").primaryKeys("id").properties("id").ifNotExist().create();
schema.edgeLabel("friend").sourceLabel("person").targetLabel("person").ifNotExist().create();

3.2 编写输入源映射文件

这里只有一个顶点文件和边文件，且文件的分隔符都是\t，所以将input.format指定为TEXT，input.delimiter使用默认即可。

顶点有一个属性id，而顶点文件头没有指明列名，所以我们需要显式地指定input.header为["id"]，input.header的作用是告诉hugegraph-loader文件的每一列的列名是什么，但要注意：列名并不一定就是顶点或边的属性名，描述文件中有一个mapping域用来将列名映射为属性名。

边没有任何属性，边文件中只有源顶点和目标顶点的Id，我们需要先将input.header指定为["source_id", "target_id"]，这样就给两个Id列取了不同的名字。然后再分别指定source和target为["source_id"]和["target_id"]，source和target的作用是告诉hugegraph-loader边的源顶点和目标顶点的Id与文件中的哪些列有关。

注意这里“有关”的含义。当顶点Id策略是PRIMARY_KEY时，source和target指定的列是主键列（加上mapping），用来拼接生成顶点Id；当顶点Id策略是CUSTOMIZE_STRING或CUSTOMIZE_NUMBER时，source和target指定的列就是Id列（加上mapping）。

由于这里顶点Id策略是PRIMARY_KEY的，所以source和target指定的列["source_id"]和["target_id"]将作为主键列，再在mapping域中指定source_id和target_id为id，hugegraph-loader就知道解析道一个source_id列的值value后，将其解释为id:value，然后使用顶点Id拼接算法生成源顶点Id（目标顶点类似）。

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "path/com-friendster.ungraph.vertex.txt",
        "format": "TEXT",
        "header": ["id"],
        "charset": "UTF-8"
      }
    }
  ],
  "edges": [
    {
      "label": "friend",
      "source": ["source_id"],
      "target": ["target_id"],
      "input": {
        "type": "file",
        "path": "path/com-friendster.ungraph.txt",
        "format": "TEXT",
        "header": ["source_id", "target_id"],
        "comment_symbols": ["#"]
      },
      "mapping": {
        "source_id": "id",
        "target_id": "id"
      }
    }
  ]
}

由于边文件中前面几行是注释行，可以使用"comment_symbols": ["#"]令hugegraph-loader忽略以#开头的行。

更多关于映射文件的介绍请参考：官网hugegraph-loader编写输入源映射文件

4. 执行导入

进入到hugegraph-loader目录下，执行以下命令（记得修改路径）：

$ bin/hugegraph-loader -g hugegraph -f ../data/com-friendster/struct.json -s ../data/com-friendster/schema.groovy --check-vertex false

这时hugegraph-loader就会开始导入数据，并会打印进度到控制台上，等所有顶点和边导入完成后，会看到以下统计信息：

Vertices has been imported: 65608366
Edges has been imported: 1806067135
---------------------------------------------
vertices results:
    parse failure vertices   :  0
    insert failure vertices  :  0
    insert success vertices  :  65608366
---------------------------------------------
edges results:
    parse failure edges      :  0
    insert failure edges     :  0
    insert success edges     :  1806067135
---------------------------------------------
time results:
    vertices loading time    :  200
    edges loading time       :  8089
    total loading time       :  8289

顶点和边的导入速度分别为：65608366 / 200 = 328041.83(顶点/秒)，1806067135 / 8089 = 223274.46(边/秒)。