多种MySQL与Elasticsearch的数据同步解决方案

es常用来解决大数据量下mysql查询的性能问题，而他们之间的数据同步问题就很关键。mysql和es的数据同步方案网上有很多，在这里总结记录下我使用过的三种方案。

还有一个阿里云的开源软件 canal 也可以解决这个同步问题，他的原理与下边的 go-mysql-elasticsearch 很像，都是通过监控MySQL的binlog日志来实现同步的。但是我没具体使用过，就不多说它了，感兴趣的自己去搜一下这款工具。

第一种方案：代码层面实现

项目开发我使用的是Laravel框架，所以采用了 Laravel Redis 队列 + ES API 的方式来实现的数据同步。

原理：使用 Laravel Redis 队列，在代码中MySQL新增数据之后触发异步任务调用 ES 的 API，将数据同步到ES中。

这种方案的好处就是实现和维护简单，缺点就是与业务代码耦合太重

PS：如何在laravel中接入es这个就不说了，不了解的可以看之前的文章：docker安装es以及在Laravel中的接入

在es中先创建好相应的索引（这是个商城项目，以新增商品为例）

PUT /products/
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "long_name":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "brand_id":{
        "type": "integer"
      },
      "category_id":{
        "type":"integer"
      },
      "shop_id":{
        "type":"integer"
      },
      "price":{
        "type":"scaled_float",
        "scaling_factor":100
      },
      "sold_count":{
        "type":"integer"
      },
      "review_count":{
        "type":"integer"
      },
      "status":{
        "type":"integer"
      },
      "create_time" : {
          "type" : "date"
      },
      "last_time" : {
          "type" : "date"
      }
    }
  }
}

修改laravel队列驱动为Redis

# 在.env文件中修改
QUEUE_CONNECTION=redis
# 如果要修改更多默认配置在 config/queue.php 文件中

在商品模型（App\Models\Product.php）中配置

/**
* 取出要同步到 es中的数据
* @return array
*/
public function toESArray()
{
    $arr = Arr::only($this->toArray(), [
        'id',
        'name',
        'long_name',
        'brand_id',
        'category_id',
        'shop_id',
        'price',
        'sold_count',
        'review_count',
        'status',
        'create_time',
        'last_time'
    ]);

    return $arr;
}

创建监听任务
```
php artisan make:job SyncProductToES
```

编写任务中的代码

<?php

namespace App\Jobs;

use App\Models\Product;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;

class SyncProductToES implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    protected $product;

    /**
     * Create a new job instance.
     *
     * @return void
     */
    public function __construct(Product $product)
    {
        $this->product = $product;
    }

    /**
     * Execute the job.
     *
     * @return void
     */
    public function handle()
    {
        $data = $this->product->toESArray();
        app('es')->index([
            'index' => 'products',
            'type'  => '_doc',
            'id'    => $data['id'],
            'body'  => $data,
        ]);
    }
}

在需要数据同步的地方触发这个任务

$form->saved(function (Form $form) {
    $product = $form->model();
    dispatch(new SyncProductToES($product));
});

启动队列
```
php artisan queue:work
```

将mysql中已有的数据导入到es中

上述一系列操作，可以实现增量同步，在每次新增数据时都会写入es。旧数据的全量同步我这里通过创建一个 Artisan 命令来实现。

创建命令

php artisan make:command Elasticsearch/SyncProducts

编写代码

<?php

namespace App\Console\Commands\Elasticsearch;

use App\Models\Product;
use Illuminate\Console\Command;

class SyncProducts extends Command
{
    /**
     * The name and signature of the console command.
     *
     * @var string
     */
    protected $signature = 'es:sync-products';

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = '将商品数据同步到 Elasticsearch';

    /**
     * Create a new command instance.
     *
     * @return void
     */
    public function __construct()
    {
        parent::__construct();
    }

    /**
     * Execute the console command.
     */
    public function handle()
    {
        // 获取 es 对象
        $es = app('es');

        Product::query()
            // 使用 chunkById 避免一次性加载过多数据
            ->chunkById(100, function ($products) use ($es) {
                $this->info(sprintf('正在同步 ID 范围为 %s 至 %s 的商品', $products->first()->id, $products->last()->id));
                // 初始化请求体
                $req = ['body' => []];
                // 遍历商品
                foreach ($products as $product) {
                    // 将商品模型转为 es 所用的数组
                    $data = $product->toESArray();

                    $req['body'][] = [
                        'index' => [
                            '_index' => 'products',
                            '_type'  => '_doc',
                            '_id'    => $data['id'],
                        ],
                    ];
                    $req['body'][] = $data;
                }
                try {
                    // 使用 bulk 方法批量创建
                    $es->bulk($req);
                } catch (\Exception $e) {
                    $this->error($e->getMessage());
                }
            });
        $this->info('同步完成');
    }
}

测试命令

 php artisan es:sync-products

线上部署

在生产环境中，一般需要安装 Horizon 队列管理工具 和 Supervisor 进程监视器 来更好的管理队列以及提高稳定性。这两款工具的安装配置直接看laravel官方文档就好，写的很详细：https://learnku.com/docs/laravel/7.x/horizon/7514

第二种方案：使用 go-mysql-elasticsearch 工具

go-mysql-elasticsearch是一款开源的高性能的MySQL数据同步到ES的工具，由go语言开发，编译及使用非常简单。

原理：使用mysqldump获取当前MySQL的数据，然后再通过此时binlog的name和position获取增量数据，再根据binlog构建restful api写入数据到ES中。

这种方案的好处是数据同步性能非常高，而且与业务代码完全解耦；缺点是增加了开发成本，使用相对复杂，需要安装go语言的运行环境，在多表关联同步下操作比较繁琐

注意事项：(很重要，一定要看)

GitHub文档中说使用的版本要求是：MySQL < 8.0 ES < 6.0

但经过测试，我的版本是 MySQL:8.0.26，ES:7.12.1，也可以实现增量同步。只不过不能用mysqldump来同步旧数据，因为MySQL8.0之后与之前版本相比改变挺多，目前的 go-mysql-elasticsearch 版本还不支持MySQL8.0的mysqldump

MySQL binlog 格式必须是ROW模式

必须在MySQL配置文件中修改此参数，改为row：binlog_format=row

要同步的MySQL数据表必须包含主键，否则直接忽略。这是因为如果数据表没有主键，UPDATE和DELETE操作就会因为在ES中找不到对应的document而无法进行同步

在 go-mysql-elasticsearch 运行时不能更改MySQL表结构

安装 go

官网下载地址，自行选择版本：https://golang.google.cn/dl/

[root@VM-0-8-centos]# wget https://golang.google.cn/dl/go1.15.5.linux-amd64.tar.gz
[root@VM-0-8-centos]# tar -C /usr/local -zxvf go1.15.5.linux-amd64.tar.gz

或者centos下直接使用yum源安装

yum install -y go

配置环境变量（GOPATH 是go项目代码放置的目录）

[root@VM-0-8-centos go]# vim /etc/profile

export GOROOT=/usr/local/go
export GOPATH=/usr/local/app/go
export PATH=$PATH:/usr/local/go/bin

[root@VM-0-8-centos go]# source /etc/profile

测试，查看go版本

[root@VM-0-8-centos]# go version
go version go1.15.5 linux/amd64
[root@VM-0-8-centos]#

安装 go-mysql-elasticsearch

安装依赖包

yum install -y gettext-devel openssl-devel perl-CPAN perl-devel zlib-devel

安装 go-mysql-elasticsearch

PS：因为GitHub国内时常上不去，所以这条命令如果拉取失败的话就翻个墙，去GitHub下载安装包

go get github.com/siddontang/go-mysql-elasticsearch

下载完成后会存放到上边环境变量中配置的项目地址中，进入执行 make 操作

[root@VM-0-8-centos ~]# cd $GOPATH/src/github.com/siddontang/go-mysql-elasticsearch
[root@VM-0-8-centos go-mysql-elasticsearch]# ls
clear_vendor.sh  cmd  Dockerfile  elastic  etc  go.mod  go.sum  LICENSE  Makefile  README.md  river
[root@VM-0-8-centos go-mysql-elasticsearch]# make

安装完成修改配置文件，配置文件路径就是下载的这个安装包的 etc 目录下

需要修改的地方我都加了注释，其他的配置用默认的即可

[root@VM-0-8-centos go-mysql-elasticsearch]# vim etc/river.toml

# MySQL address, user and password
# user must have replication privilege in MySQL.
my_addr = "127.0.0.1:3306"  # mysql地址与端口
my_user = "root"         # mysql用户名  
my_pass = ""             # mysql密码
my_charset = "utf8"          # mysql字符集

# Set true when elasticsearch use https
#es_https = false
# Elasticsearch address  
es_addr = "127.0.0.1:9200"  # es的地址与端口 
# Elasticsearch user and password, maybe set by shield, nginx, or x-pack
es_user = ""                # es用户名，没有默认为空即可
es_pass = ""             # es密码，没有默认为空即可

# Path to store data, like master.info, if not set or empty,
# we must use this to support breakpoint resume syncing. 
# TODO: support other storage, like etcd. 
data_dir = "./var"           # 数据存储目录

# Inner Http status address
stat_addr = "127.0.0.1:12800"
stat_path = "/metrics"

# pseudo server id like a slave 
server_id = 1001

# mysql or mariadb
flavor = "mysql"

# mysqldump execution path
# if not set or empty, ignore mysqldump.
mysqldump = "mysqldump"      # 如果设置为空，则不会同步mysql中现有的旧数据

# if we have no privilege to use mysqldump with --master-data,
# we must skip it.
#skip_master_data = false

# minimal items to be inserted in one bulk
bulk_size = 128

# force flush the pending requests if we don't have enough items >= bulk_size
flush_bulk_time = "200ms"

# Ignore table without primary key
skip_no_pk_table = false

# MySQL data source
[[source]]
schema = "test"      # 需要同步的mysql数据库

# Only below tables will be synced into Elasticsearch.
# "t_[0-9]{4}" is a wildcard table format, you can use it if you have many sub tables, like table_0000 - table_1023
# I don't think it is necessary to sync all tables in a database.
tables = ["t", "t_[0-9]{4}", "tfield", "tfilter"] # 需要同步的mysql数据表

# Below is for special rule mapping

# Very simple example
# 
# desc t;
# +-------+--------------+------+-----+---------+-------+
# | Field | Type         | Null | Key | Default | Extra |
# +-------+--------------+------+-----+---------+-------+
# | id    | int(11)      | NO   | PRI | NULL    |       |
# | name  | varchar(256) | YES  |     | NULL    |       |
# +-------+--------------+------+-----+---------+-------+
# 
# The table `t` will be synced to ES index `test` and type `t`.
# 定义mysql和es同步的对应关系，有几个写几个，下边多余的可以删掉
[[rule]]
schema = "test"      # 需要同步的mysql数据库
table = "t"          # 需要同步的mysql数据表
index = "test"       # 需要同步的es索引
type = "t"           # 需要同步的es类型，es7之后类型只有一种，只能设为 _doc

# Wildcard table rule, the wildcard table must be in source tables 
# All tables which match the wildcard format will be synced to ES index `test` and type `t`.
# In this example, all tables must have same schema with above table `t`;
[[rule]]
schema = "test"
table = "t_[0-9]{4}"
index = "test"
type = "t"

# Simple field rule 
#
# desc tfield;
# +----------+--------------+------+-----+---------+-------+
# | Field    | Type         | Null | Key | Default | Extra |
# +----------+--------------+------+-----+---------+-------+
# | id       | int(11)      | NO   | PRI | NULL    |       |
# | tags     | varchar(256) | YES  |     | NULL    |       |
# | keywords | varchar(256) | YES  |     | NULL    |       |
# +----------+--------------+------+-----+---------+-------+
#
[[rule]]
schema = "test"
table = "tfield"
index = "test"
type = "tfield"

# 这个配置是定义mysql中的字段对应es中的字段，如果全都一致可以删掉这个配置
[rule.field]
# Map column `id` to ES field `es_id`
id="es_id"       # 这个就是指mysql中的id字段对应es中的es_id字段，下边同理
# Map column `tags` to ES field `es_tags` with array type 
tags="es_tags,list"
# Map column `keywords` to ES with array type
keywords=",list"

# Filter rule 
#
# desc tfilter;
# +-------+--------------+------+-----+---------+-------+
# | Field | Type         | Null | Key | Default | Extra |
# +-------+--------------+------+-----+---------+-------+
# | id    | int(11)      | NO   | PRI | NULL    |       |
# | c1    | int(11)      | YES  |     | 0       |       |
# | c2    | int(11)      | YES  |     | 0       |       |
# | name  | varchar(256) | YES  |     | NULL    |       |
# +-------+--------------+------+-----+---------+-------+
#
[[rule]]
schema = "test"
table = "tfilter"
index = "test"
type = "tfilter"

# Only sync following columns
filter = ["id", "name"]      # 指定mysql中哪些字段需要同步

# id rule
#
# desc tid_[0-9]{4};
# +----------+--------------+------+-----+---------+-------+
# | Field    | Type         | Null | Key | Default | Extra |
# +----------+--------------+------+-----+---------+-------+
# | id       | int(11)      | NO   | PRI | NULL    |       |
# | tag      | varchar(256) | YES  |     | NULL    |       |
# | desc     | varchar(256) | YES  |     | NULL    |       |
# +----------+--------------+------+-----+---------+-------+
#
[[rule]]
schema = "test"
table = "tid_[0-9]{4}"
index = "test"
type = "t"
# The es doc's id will be `id`:`tag`
# It is useful for merge muliple table into one type while theses tables have same PK 
id = ["id", "tag"]

再提供个本次测试使用的配置文件，去掉了所有的注释，这样看起来简洁一点

my_addr = "172.17.0.4:3306"  
my_user = "root"
my_pass = "root"
my_charset = "utf8"

es_addr = "172.17.0.7:9200"
es_user = ""
es_pass = ""

data_dir = "/docker/data"

stat_addr = "127.0.0.1:12800"
stat_path = "/metrics"
server_id = 1001
flavor = "mysql"
mysqldump = ""
bulk_size = 128
flush_bulk_time = "200ms"
skip_no_pk_table = false

[[source]]
schema = "lmrs"
tables = ["lmrs_products"]

[[rule]]
schema = "lmrs"
table = "lmrs_products"
index = "products"
type = "_doc"
filter = ["id","name","long_name","brand_id","shop_id","price","sold_count","review_count","status","create_time","last_time","three_category_id"]

[rule.field]
mysql = "three_category_id"
elastic = "category_id"

启动 go-mysql-elasticsearch，输出以下信息证明成功

[root@VM-0-8-centos go-mysql-elasticsearch]# ./bin/go-mysql-elasticsearch -config=./etc/river.toml
[2021/08/01 13:37:06] [info] binlogsyncer.go:141 create BinlogSyncer with config {1001 mysql 127.0.0.1 3306 root   utf8mb4 false false <nil> false UTC false 0 0s 0s 0 false 0}
[2021/08/01 13:37:06] [info] dump.go:180 skip dump, use last binlog replication pos (mysql-bin.000001, 2606) or GTID set <nil>
[2021/08/01 13:37:06] [info] binlogsyncer.go:362 begin to sync binlog from position (mysql-bin.000001, 2606)
[2021/08/01 13:37:06] [info] binlogsyncer.go:211 register slave for master server 127.0.0.1:3306
[2021/08/01 13:37:06] [info] sync.go:25 start sync binlog at binlog file (mysql-bin.000001, 2606)
[2021/08/01 13:37:06] [info] binlogsyncer.go:731 rotate to (mysql-bin.000001, 2606)
[2021/08/01 13:37:06] [info] sync.go:71 rotate binlog to (mysql-bin.000001, 2606)
[2021/08/01 13:37:06] [info] master.go:54 save position (mysql-bin.000001, 2606)

如果觉得上述两步太麻烦，可以直接使用docker来安装 go-mysql-elasticsearch，镜像中自带了go语言环境

拉取镜像
```
docker pull gozer/go-mysql-elasticsearch
```
构建容器，其中 river.toml 配置文件与上边的内容一样
```
docker run -p 12345:12345 -d --name go-mysql-es -v /docker/go-mysql-es/river.toml:/config/river.toml --privileged=true gozer/go-mysql-elasticsearch
```

第三种方案：使用 Logstash 工具

Logstash 是免费且开放的服务器端数据处理管道，能够从多个来源采集数据，转换数据，然后将数据发送到您最喜欢的“存储库”中，可与各种部署集成。它提供了大量插件，可帮助你解析，丰富，转换和缓冲来自各种来源的数据。如果你的数据需要 Beats 中没有的其他处理，则需要将 Logstash 添加到部署中。

这个工具不止可以用来做mysql到es的数据同步，它的应用场景还有：日志搜索器（ logstash采集、处理、转发到elasticsearch存储，在kibana进行展示）、elk日志分析（elasticsearch + logstash + kibana）等。

它既可以全量同步旧数据，也可以增量同步新数据，而且对mysql和es没有版本方面的限制，只需对应版本即可

安装

官方下载地址：https://www.elastic.co/cn/downloads/past-releases#logstash

PS： logstash 的版本一定要和 es 保持一致，我的 es 是 7.12.1 版本，所以 logstash 也下载的 7.12.1 版本
```
wget https://artifacts.elastic.co/downloads/logstash/logstash-7.12.1-linux-x86_64.tar.gz
```
也可以直接使用docker安装，更方便
```
docker pull logstash:7.12.1
```

安装两个插件

logstash-input-jdbc：连接读取mysql中数据的插件（6.0之后的版本已经自带了，再次安装会提示报错）

logstash-output-elasticsearch：数据输出到es的插件

[root@localhost]# tar -C /usr/local -zxvf logstash-7.12.1-linux-x86_64.tar.gz
[root@localhost]# cd /usr/local/logstash-7.12.1/bin
[root@localhost bin]# ./logstash-plugin install logstash-input-jdbc
...
ERROR: Installation aborted, plugin 'logstash-input-jdbc' is already provided by 'logstash-integration-jdbc'
[root@localhost bin]# ./logstash-plugin install logstash-output-elasticsearch
...
Installation successful
[root@localhost bin]#

下载 jdbc 的 mysql-connection.jar 包，版本与自己的 mysql 版本保持一致

[root@localhost logstash-7.12.1]# mkdir pipeline
[root@localhost logstash-7.12.1]# cd pipeline/
[root@localhost pipeline]# wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.26/mysql-connector-java-8.0.26.jar

更改配置文件

[root@localhost logstash-7.12.1]# vi config/logstash.yml
# 加入以下内容，下边那个是es的地址，根据自己的情况改
http.host: "0.0.0.0"
xpack.monitoring.elasticsearch.hosts: ["http://172.17.0.2:9200"]

[root@localhost logstash-7.12.1]# vi config/pipelines.yml
# 加入以下内容，路径同样也是根据自己实际的来
pipeline.id: table1
path.config: "/usr/local/logstash-7.12.1/pipeline/logstash.config"

创建上边配置里的指定的配置文件 logstash.config

vi pipeline/logstash.config

input {
    stdin {}
    # 可以有多个jdbc，来同步不同的数据表
    jdbc {
        # 类型，区分开每个 jdbc，以便输出的时候做判断
        type => "product"
        # 注意mysql连接地址一定要用ip，不能使用localhost等
        jdbc_connection_string => "jdbc:mysql://172.17.0.4:3306/lmrs"
        jdbc_user => "root"
        jdbc_password => "root"
        # 数据库重连尝试次数
        connection_retry_attempts => "3"
        # 数据库连接校验超时时间，默认为3600s
        jdbc_validation_timeout => "3600"
        # 这个jar包就是上边下载那个，可以是绝对路径也可以是相对路径，把地址写对
        jdbc_driver_library => "/usr/local/logstash-7.12.1/pipeline/mysql-connector-java-8.0.26.jar"
        # 驱动类名
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        # 开启分页，默认是 false
        jdbc_paging_enabled => "true"
        # 单次分页查询条数（默认100000，字段较多的话，可以适当调整这个数值）
        jdbc_page_size => "50000"
        # 要执行的sql，从这查出的数据就会同步到es中
        statement => "select id,`name`,long_name,brand_id,three_category_id as category_id,shop_id,price,status,sold_count,review_count,create_time,last_time from lmrs_products"
        # 执行的sql文件路径，这与上边的 statement 参数 二选一
        # statement_filepath => "/usr/local/logstash-7.12.1/pipeline/products.sql"
        # 是否将字段名转为小写，默认为true（如果具备序列化或者反序列化，建议设置为false）
        lowercase_column_names => false
        # 需要记录查询结果某字段的值时，此字段为true，否则默认tracking_colum为timestamp的值
        use_column_value => true
        # 需要记录的字段，同于增量同步，需要是数据库字段
        tracking_column => id
        # 记录字段的数据类型
        tracking_column_type => numeric
        # 上次数据存放位置
        record_last_run => true
        # 上一个sql_last_value的存放路径，必须在文件中指定字段的初始值，手动创建文件并赋予读写权限
        last_run_metadata_path => "/usr/local/logstash-7.12.1/pipeline/products.txt"
        # 是否清除last_run_metadata_path的记录，需要增量同步这个字段的值必须为false
        clean_run => false
        # 设置定时任务间隔  含义：分、时、天、月、年，全部为*默认为每分钟跑一次任务
        schedule => "* * * * *"
    }
}
output {
    # 判断类型
    if [type] == "product" {
        # es的配置
        elasticsearch {
            hosts => "172.17.0.2:9200"
            index => "products"
            document_type => "_doc"
            document_id => "%{id}"
        }
    }

    # 日志输出
    stdout {
        codec => json_lines
    }
}

启动 Logstash（--config.reload.automatic 选项启用自动配置重新加载，不必在每次修改配置文件时停止并重新启动 Logstash）

[root@localhost logstash-7.12.1]# ./bin/logstash -f pipeline/logstash.config --config.reload.automatic

浏览器访问 ip:9600 可以打印出以下信息证明启动成功

{"host":"localhost.localdomain","version":"7.12.1","http_address":"0.0.0.0:9600","id":"15320442-569b-4bfd-a0d6-4c71619bc06d","name":"localhost.localdomain","ephemeral_id":"f6868c4c-fff1-4b6a-89d9-4ca7ea469c6e","status":"green","snapshot":false,"pipeline":{"workers":4,"batch_size":125,"batch_delay":50},"build_date":"2021-04-20T19:51:54Z","build_sha":"a0a95c823ae2da19a75f44a01784665e7ad23d15","build_snapshot":false}

总结

go-mysql-elasticsearch 和 Logstash 工具都可以放到 Supervisor 中来管控，来提高稳定性。

这三种方案总的来说各有利弊，至于选择哪种个人认为：如果项目不是特别大，数据量增长速度也不快，对性能没太高的要求的话可以考虑第一种，因为实现简单，利于维护；如果对双方同步性能要求比较高，或者数据量很大的情况下，就考虑后两种。

多种MySQL与Elasticsearch的数据同步解决方案

第一种方案：代码层面实现

第二种方案：使用 go-mysql-elasticsearch 工具

第三种方案：使用 Logstash 工具

总结

推荐阅读更多精彩内容