Clickhouse 入门

clickhouse 简介
ck是一个列式存储的数据库，其针对的场景是OLAP。OLAP的特点是：

数据不经常写，即便写也是批量写。不像OLTP是一条一条写
大多数是读请求
查询并发较少，不适合放置先生高并发业务场景使用 , CK本身建议最大一秒100个并发查询。
不要求事务

click的优点

为了增强压缩比例，ck存储的一列长度固，于是存储的时候，不用在存储该列的长度信息

使用向量引擎 , vector engine ，什么是向量引擎？
https://www.infoq.cn/article/columnar-databases-and-vectorization/?itm_source=infoq_en&itm_medium=link_on_en_item&itm_campaign=item_in_other_langs

clickhouse的缺点

不能完整支持事务
不能很高吞吐量的修改或删除数据
由于索引的稀疏性，不适合基于key来查询单个记录

性能优化

为了提高插入性能，最好批量插入，最少批次是1000行记录。且使用并发插入能显著提高插入速度。

访问接口

ck像es一样暴露两个端口，一个tcp的，一个http的。tcp默认端口：9000 ,http默认端口：8123。一般我们并不直接通过这些端口与ck交互，而是使用一些客户端，这些客户端可以是：

Command-line Client 通过它可以链接ck,然后进行基本的crud操作，还可以导入数据到ck 。它使用tcp端口链接ck
http interface : 能像es一样，通过rest方式，按照ck自己的语法，提交crud
jdbc driver
odbc driver

输入输出格式

ck能够读写多种格式做为输入(即insert)，也能在输出时(即select )吐出指定的格式。

比如插入数据时，指定数据源的格式为JSONEachRow

INSERT INTO UserActivity FORMAT JSONEachRow {"PageViews":5, "UserID":"4324182021466249494", "Duration":146,"Sign":-1} {"UserID":"4324182021466249494","PageViews":6,"Duration":185,"Sign":1}

读取数据时，指定格式为JSONEachRow

SELECT * FROM UserActivity FORMAT JSONEachRow

值得注意的时指定这些格式应该是ck解析或生成的格式，并不是ck最终的的存储格式，ck应该还是按自己的列式格式进行存储。ck支持多种格式，具体看文档
https://clickhouse.yandex/docs/en/interfaces/formats/#native

数据库引擎

ck支持在其中ck中创建一个数据库，但数据库的实际存储是Mysql，这样就可以通过ck对该库中表的数据进行crud, 有点像hive中的外表，只是这里外挂的是整个数据库。

假设mysql中有以下数据

mysql> USE test;
Database changed

mysql> CREATE TABLE `mysql_table` (
    ->   `int_id` INT NOT NULL AUTO_INCREMENT,
    ->   `float` FLOAT NOT NULL,
    ->   PRIMARY KEY (`int_id`));
Query OK, 0 rows affected (0,09 sec)

mysql> insert into mysql_table (`int_id`, `float`) VALUES (1,2);
Query OK, 1 row affected (0,00 sec)

mysql> select * from mysql_table;
+--------+-------+
| int_id | value |
+--------+-------+
|      1 |     2 |
+--------+-------+
1 row in set (0,00 sec)

在ck中创建数据库，链接上述mysql

CREATE DATABASE mysql_db ENGINE = MySQL('localhost:3306', 'test', 'my_user', 'user_password')

然后就可以在ck中，对mysql库进行一系列操作

file

表引擎(table engine)—MergeTree 家族

表引擎定义一个表创建是时候，使用什么引擎进行存储。表引擎控制如下事项

数据如何读写以及，以及存储位置
支持的查询能力
数据并发访问能力
数据的replica特征

MergeTree 引擎

建表时，指定table engine相关配置

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr]
[SETTINGS name=value, ...]

该引擎会数据进行分区存储。
数据插入时，不同分区的数据，会分为不同的数据段(data part), ck后台再对这些data part做合并，不同的分区的data part不会合到一起
一个data part 由有许多不可分割的最小granule组成

部分配置举例

ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192

granule

file

gruanule是按主键排序后，紧邻在一起，不可再分割的数据集。每个granule 的第一行数据的主键作为这个数据作为这个数据集的mark 。比如这里的主键是(CounterID, Date)。第一个granule排序的第一列数据，其主键为a,1 ,可以看到多一个gruanle中的多行数据，其主键可以相同。

同时为了方便索引，ck会对每个granule指定一个mark number, 方便实际使用的（通过编号，总比通过实际的主键值要好使用一点）。

这种索引结构非常像跳表。也称为稀疏索引，因为它不是对每一行数据做索引，而是以排序后的数据范围做索引。

查询举例，如果我们想查询CounterID in ('a', 'h')，ck服务器基于上述结构，实际读取的数据范围为[0, 3) and [6, 8)

可以在建表时，通过index_granularity指定，两个mark之间存储的行记录数，也即granule的大小(因为两个mark间就是一个granule)

TTL

可以对表和字段进行过期设置

MergeTree 总结

MergeTree 相当于MergeTree家族表引擎的超类。它定义整个MergeTree家族的数据文件存储的特征。即

有数据合并
有稀疏索引，像跳表一样的数据结构，来存储数据集。
可以指定数据分区

而在此数据基础上，衍生出了一些列增对不同应用场景的子MergeTree。他们分别是

ReplacingMergeTree 自动移除primary key相同的数据
SummingMergeTree　能够将相同主键的，数字类型字段进行sum,　最后存为一行，这相当于预聚合，它能减少存储空间，提升查询性能
AggregatingMergeTree　能够将同一主键的数据，按一定规则聚合，减少数据存储，提高聚合查询的性能，相当于预聚合。
CollapsingMergeTree　将大多数列内容都相同，但是部分列值不同，但是数据是成对的行合并，比如列的值是1和-1

ReplicatedMergeTree　引擎

ck中创建的表，默认都是没有replicate的，为了提高可用性，需要引入replicate。ck的引入方式是通过集成zookeeper实现数据的replicate副本。

正对上述的各种预聚合引擎，也有对应的ReplicatedMergeTree 引擎进行支持

ReplicatedMergeTree
ReplicatedSummingMergeTree
ReplicatedReplacingMergeTree
ReplicatedAggregatingMergeTree
ReplicatedCollapsingMergeTree
ReplicatedVersionedCollapsingMergeTree
ReplicatedGraphiteMergeTree

表引擎(table engine)— Log Engine 家族

该系列表引擎正对的是那种会持续产生需要小表，并且各个表数据量都不大的日志场景。这些引擎的特点是：

数据存储在磁盘上
以apeend方式新增数据
写是加锁，读需等待，也即查询性能不高

表引擎(table engine)— 外部数据源

ck建表时，还支持许多外部数据源引擎，他们应该是像hive　外表一样，只是建立了一个表形态的链接，实际存储还是源数据源。(这个有待确认)

这些外部数据源表引擎有：

Kafka
MySQL
JDBC
ODBC
HDFS

Sql语法

sample 语句

在建表的时候，可以指定基于某个列的散列值做sample (之所以hash散列，是为了保证抽样的均匀和随机).这样我们在查询的时候，可以不用对全表数据做处理，而是基于sample抽样一部分数据，进行结构计算就像。比如全表有100个人，如果要计算这一百个人的总成绩，可以使用sample取十个人，将其成绩求和后，乘以10。sample适用于那些不需要精确计算，并且对计算耗时非常敏感的业务场景。

安装事宜

一些tips

生产环境关掉swap file

Disable the swap file for production environments.

记录集群运行情况的一些表

system.metrics, system.events, and system.asynchronous_metrics tables.

安装环境配置

cpu频率控制

Linux系统，会根据任务的负荷对cpu进行降频或升频，这些调度升降过程会影响到ck的性能，使用以下配置，将cpu的频率开到最大

echo 'performance' | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

linux系统频率可能的配置如下：

file

运行超额分配内存

基于swap 磁盘机制，Linux系统可以支持应用系统对超过物理内存实际大小的，内存申请，基本原理是将一部分的不用的数据，swap到硬盘，腾出空间给正在用的数据，这样对上层应用来看，仿佛拥有了很大的内存量，这种允许超额申请内存的行为叫：Overcommiting Memory

控制Overcommiting Memory行为的有三个数值

0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available.
1: The Linux kernel will always overcommit memory, and never check if enough memory is available. This increases the risk of out-of-memory situations, but also improves memory-intensive workloads.
2: The Linux kernel will not overcommit memory, and only allocate as much memory as defined in overcommit_ratio.

ck需要尽可能多的内存，所以需要开启超额申请的功能，修改配置如下

 echo 0 | sudo tee /proc/sys/vm/overcommit_memory

关闭透明内存

Huge Pages 操作系统为了提速处理，将部分应用内存页放到了处理器中，这个页叫hug pages。而为了透明化这一过程，linux启用了khugepaged内核线程来专门负责此事，这种透明自动化的方式叫： transparent hugepages 。但自动化的方式会带来内存泄露的风险，具体原因看参考链接。

所以CK安装期望关闭该选项：

echo 'never' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

尽量用大的网络带宽

如果是ipv6的话，需要增大 route cache

不要将zk和ck装在一起

ck会尽可能的多占用资源来保证性能，所以如果跟zk装在一起，ck会影响zk,使其吞吐量下降，延迟增高

开启zk日志清理功能

zk默认不会删除过期的snapshot和log文件，日积月累将是个定时炸弹，所以需要修改zk配置，启用autopurge功能，yandex的配置如下:

zk配置zoo.cfg

# http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=30000
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=10

maxClientCnxns=2000

maxSessionTimeout=60000000
# the directory where the snapshot is stored.
dataDir=/opt/zookeeper/{{ cluster['name'] }}/data
# Place the dataLogDir to a separate physical disc for better performance
dataLogDir=/opt/zookeeper/{{ cluster['name'] }}/logs

autopurge.snapRetainCount=10
autopurge.purgeInterval=1


# To avoid seeks ZooKeeper allocates space in the transaction log file in
# blocks of preAllocSize kilobytes. The default block size is 64M. One reason
# for changing the size of the blocks is to reduce the block size if snapshots
# are taken more often. (Also, see snapCount).
preAllocSize=131072

# Clients can submit requests faster than ZooKeeper can process them,
# especially if there are a lot of clients. To prevent ZooKeeper from running
# out of memory due to queued requests, ZooKeeper will throttle clients so that
# there is no more than globalOutstandingLimit outstanding requests in the
# system. The default limit is 1,000.ZooKeeper logs transactions to a
# transaction log. After snapCount transactions are written to a log file a
# snapshot is started and a new transaction log file is started. The default
# snapCount is 10,000.
snapCount=3000000

# If this option is defined, requests will be will logged to a trace file named
# traceFile.year.month.day.
#traceFile=

# Leader accepts client connections. Default value is "yes". The leader machine
# coordinates updates. For higher update throughput at thes slight expense of
# read throughput the leader can be configured to not accept clients and focus
# on coordination.
leaderServes=yes

standaloneEnabled=false
dynamicConfigFile=/etc/zookeeper-{{ cluster['name'] }}/conf/zoo.cfg.dynamic

对应的jvm参数

NAME=zookeeper-{{ cluster['name'] }}
ZOOCFGDIR=/etc/$NAME/conf

# TODO this is really ugly
# How to find out, which jars are needed?
# seems, that log4j requires the log4j.properties file to be in the classpath
CLASSPATH="$ZOOCFGDIR:/usr/build/classes:/usr/build/lib/*.jar:/usr/share/zookeeper/zookeeper-3.5.1-metrika.jar:/usr/share/zookeeper/slf4j-log4j12-1.7.5.jar:/usr/share/zookeeper/slf4j-api-1.7.5.jar:/usr/share/zookeeper/servlet-api-2.5-20081211.jar:/usr/share/zookeeper/netty-3.7.0.Final.jar:/usr/share/zookeeper/log4j-1.2.16.jar:/usr/share/zookeeper/jline-2.11.jar:/usr/share/zookeeper/jetty-util-6.1.26.jar:/usr/share/zookeeper/jetty-6.1.26.jar:/usr/share/zookeeper/javacc.jar:/usr/share/zookeeper/jackson-mapper-asl-1.9.11.jar:/usr/share/zookeeper/jackson-core-asl-1.9.11.jar:/usr/share/zookeeper/commons-cli-1.2.jar:/usr/src/java/lib/*.jar:/usr/etc/zookeeper"

ZOOCFG="$ZOOCFGDIR/zoo.cfg"
ZOO_LOG_DIR=/var/log/$NAME
USER=zookeeper
GROUP=zookeeper
PIDDIR=/var/run/$NAME
PIDFILE=$PIDDIR/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
JAVA=/usr/bin/java
ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
ZOO_LOG4J_PROP="INFO,ROLLINGFILE"
JMXLOCALONLY=false
JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
    -Xmx{{ cluster.get('xmx','1G') }} \
    -Xloggc:/var/log/$NAME/zookeeper-gc.log \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=16 \
    -XX:GCLogFileSize=16M \
    -verbose:gc \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintGCApplicationConcurrentTime \
    -XX:+PrintSafepointStatistics \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled"

数据备份

数据除了存储在ck之外，可以在hdfs中保留一份，以防止ck数据丢失后，无法恢复。

配置文件

ck的默认配置文件为/etc/clickhouse-server/config.xml，你可以在其中指定所有的服务器配置。

当然你可以将各种不同的配置分开，比如user的配置，和quota的配置，单独放一个文件，其余文件放置的路径为

 /etc/clickhouse-server/config.d

ck最终会将所有的配置合在一起生成一个完整的配置file-preprocessed.xml

各个分开的配置，可以覆盖或删除主配置中的相同配置，使用replace或remove属性就行，比如

<query_masking_rules>
    <rule>
        <name>hide SSN</name>
        <regexp>\b\d{3}-\d{2}-\d{4}\b</regexp>
        <replace>000-00-0000</replace>
    </rule>
</query_masking_rules>

同时ck还可以使用zk做为自己的配置源，即最终配置文件的生成，会使用zk中的配置。

默认情况下：
users, access rights, profiles of settings, quotas这些设置都在users.xml

一些最佳实践

一些最佳配置实践：
1.写入时，不要使用distribution 表，怕出现数据不一致
2.设置background_pool_size ，提升Merge的速度，因为merge线程就是使用这个线程池
3.设置max_memory_usage和max_memory_usage_for_all_queries，限制ck使用物理内存的大小，因为使用内存过大，操作系统会将ck进程杀死
4.设置max_bytes_before_external_sort和max_bytes_before_external_group_by，来使得聚合的sort和group在需要大内存且内存超过上述限制时，不至于失败，可以转而使用硬盘进行处理

clickhouse 简介

ck是一个列式存储的数据库，其针对的场景是OLAP。OLAP的特点是：

数据不经常写，即便写也是批量写。不像OLTP是一条一条写
大多数是读请求
查询并发较少，不适合放置先生高并发业务场景使用 , CK本身建议最大一秒100个并发查询。
不要求事务

click的优点

为了增强压缩比例，ck存储的一列长度固，于是存储的时候，不用在存储该列的长度信息

clickhouse的缺点

不能完整支持事务
不能很高吞吐量的修改或删除数据
由于索引的稀疏性，不适合基于key来查询单个记录

性能优化

为了提高插入性能，最好批量插入，最少批次是1000行记录。且使用并发插入能显著提高插入速度。

访问接口

Command-line Client 通过它可以链接ck,然后进行基本的crud操作，还可以导入数据到ck 。它使用tcp端口链接ck
http interface : 能像es一样，通过rest方式，按照ck自己的语法，提交crud
jdbc driver
odbc driver

输入输出格式

ck能够读写多种格式做为输入(即insert)，也能在输出时(即select )吐出指定的格式。

比如插入数据时，指定数据源的格式为JSONEachRow

INSERT INTO UserActivity FORMAT JSONEachRow {"PageViews":5, "UserID":"4324182021466249494", "Duration":146,"Sign":-1} {"UserID":"4324182021466249494","PageViews":6,"Duration":185,"Sign":1}

读取数据时，指定格式为JSONEachRow

SELECT * FROM UserActivity FORMAT JSONEachRow

数据库引擎

假设mysql中有以下数据

mysql> USE test;
Database changed

mysql> CREATE TABLE `mysql_table` (
    ->   `int_id` INT NOT NULL AUTO_INCREMENT,
    ->   `float` FLOAT NOT NULL,
    ->   PRIMARY KEY (`int_id`));
Query OK, 0 rows affected (0,09 sec)

mysql> insert into mysql_table (`int_id`, `float`) VALUES (1,2);
Query OK, 1 row affected (0,00 sec)

mysql> select * from mysql_table;
+--------+-------+
| int_id | value |
+--------+-------+
|      1 |     2 |
+--------+-------+
1 row in set (0,00 sec)

在ck中创建数据库，链接上述mysql

CREATE DATABASE mysql_db ENGINE = MySQL('localhost:3306', 'test', 'my_user', 'user_password')

然后就可以在ck中，对mysql库进行一系列操作

表引擎(table engine)—MergeTree 家族

表引擎定义一个表创建是时候，使用什么引擎进行存储。表引擎控制如下事项

数据如何读写以及，以及存储位置
支持的查询能力
数据并发访问能力
数据的replica特征

MergeTree 引擎

建表时，指定table engine相关配置

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr]
[SETTINGS name=value, ...]

该引擎会数据进行分区存储。
数据插入时，不同分区的数据，会分为不同的数据段(data part), ck后台再对这些data part做合并，不同的分区的data part不会合到一起
一个data part 由有许多不可分割的最小granule组成

部分配置举例

ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192

granule

同时为了方便索引，ck会对每个granule指定一个mark number, 方便实际使用的（通过编号，总比通过实际的主键值要好使用一点）。

这种索引结构非常像跳表。也称为稀疏索引，因为它不是对每一行数据做索引，而是以排序后的数据范围做索引。

查询举例，如果我们想查询CounterID in ('a', 'h')，ck服务器基于上述结构，实际读取的数据范围为[0, 3) and [6, 8)

可以在建表时，通过index_granularity指定，两个mark之间存储的行记录数，也即granule的大小(因为两个mark间就是一个granule)

TTL

可以对表和字段进行过期设置

MergeTree 总结

MergeTree 相当于MergeTree家族表引擎的超类。它定义整个MergeTree家族的数据文件存储的特征。即

有数据合并
有稀疏索引，像跳表一样的数据结构，来存储数据集。
可以指定数据分区

而在此数据基础上，衍生出了一些列增对不同应用场景的子MergeTree。他们分别是

ReplacingMergeTree 自动移除primary key相同的数据
SummingMergeTree　能够将相同主键的，数字类型字段进行sum,　最后存为一行，这相当于预聚合，它能减少存储空间，提升查询性能
AggregatingMergeTree　能够将同一主键的数据，按一定规则聚合，减少数据存储，提高聚合查询的性能，相当于预聚合。
CollapsingMergeTree　将大多数列内容都相同，但是部分列值不同，但是数据是成对的行合并，比如列的值是1和-1

ReplicatedMergeTree　引擎

ck中创建的表，默认都是没有replicate的，为了提高可用性，需要引入replicate。ck的引入方式是通过集成zookeeper实现数据的replicate副本。

正对上述的各种预聚合引擎，也有对应的ReplicatedMergeTree 引擎进行支持

ReplicatedMergeTree
ReplicatedSummingMergeTree
ReplicatedReplacingMergeTree
ReplicatedAggregatingMergeTree
ReplicatedCollapsingMergeTree
ReplicatedVersionedCollapsingMergeTree
ReplicatedGraphiteMergeTree

表引擎(table engine)— Log Engine 家族

该系列表引擎正对的是那种会持续产生需要小表，并且各个表数据量都不大的日志场景。这些引擎的特点是：

数据存储在磁盘上
以apeend方式新增数据
写是加锁，读需等待，也即查询性能不高

表引擎(table engine)— 外部数据源

ck建表时，还支持许多外部数据源引擎，他们应该是像hive　外表一样，只是建立了一个表形态的链接，实际存储还是源数据源。(这个有待确认)

这些外部数据源表引擎有：

Kafka
MySQL
JDBC
ODBC
HDFS

Sql语法

sample 语句

安装事宜

一些tips

生产环境关掉swap file

Disable the swap file for production environments.

记录集群运行情况的一些表

system.metrics, system.events, and system.asynchronous_metrics tables.

安装环境配置

cpu频率控制

Linux系统，会根据任务的负荷对cpu进行降频或升频，这些调度升降过程会影响到ck的性能，使用以下配置，将cpu的频率开到最大

echo 'performance' | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

linux系统频率可能的配置如下：

运行超额分配内存

控制Overcommiting Memory行为的有三个数值

0: The Linux kernel is free to overcommit memory (this is the default), a heuristic algorithm is applied to figure out if enough memory is available.
1: The Linux kernel will always overcommit memory, and never check if enough memory is available. This increases the risk of out-of-memory situations, but also improves memory-intensive workloads.
2: The Linux kernel will not overcommit memory, and only allocate as much memory as defined in overcommit_ratio.

ck需要尽可能多的内存，所以需要开启超额申请的功能，修改配置如下

 echo 0 | sudo tee /proc/sys/vm/overcommit_memory

关闭透明内存

所以CK安装期望关闭该选项：

echo 'never' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

尽量用大的网络带宽

如果是ipv6的话，需要增大 route cache

不要将zk和ck装在一起

ck会尽可能的多占用资源来保证性能，所以如果跟zk装在一起，ck会影响zk,使其吞吐量下降，延迟增高

开启zk日志清理功能

zk默认不会删除过期的snapshot和log文件，日积月累将是个定时炸弹，所以需要修改zk配置，启用autopurge功能，yandex的配置如下:

zk配置zoo.cfg

# http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=30000
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=10

maxClientCnxns=2000

maxSessionTimeout=60000000
# the directory where the snapshot is stored.
dataDir=/opt/zookeeper/{{ cluster['name'] }}/data
# Place the dataLogDir to a separate physical disc for better performance
dataLogDir=/opt/zookeeper/{{ cluster['name'] }}/logs

autopurge.snapRetainCount=10
autopurge.purgeInterval=1


# To avoid seeks ZooKeeper allocates space in the transaction log file in
# blocks of preAllocSize kilobytes. The default block size is 64M. One reason
# for changing the size of the blocks is to reduce the block size if snapshots
# are taken more often. (Also, see snapCount).
preAllocSize=131072

# Clients can submit requests faster than ZooKeeper can process them,
# especially if there are a lot of clients. To prevent ZooKeeper from running
# out of memory due to queued requests, ZooKeeper will throttle clients so that
# there is no more than globalOutstandingLimit outstanding requests in the
# system. The default limit is 1,000.ZooKeeper logs transactions to a
# transaction log. After snapCount transactions are written to a log file a
# snapshot is started and a new transaction log file is started. The default
# snapCount is 10,000.
snapCount=3000000

# If this option is defined, requests will be will logged to a trace file named
# traceFile.year.month.day.
#traceFile=

# Leader accepts client connections. Default value is "yes". The leader machine
# coordinates updates. For higher update throughput at thes slight expense of
# read throughput the leader can be configured to not accept clients and focus
# on coordination.
leaderServes=yes

standaloneEnabled=false
dynamicConfigFile=/etc/zookeeper-{{ cluster['name'] }}/conf/zoo.cfg.dynamic

对应的jvm参数

NAME=zookeeper-{{ cluster['name'] }}
ZOOCFGDIR=/etc/$NAME/conf

# TODO this is really ugly
# How to find out, which jars are needed?
# seems, that log4j requires the log4j.properties file to be in the classpath
CLASSPATH="$ZOOCFGDIR:/usr/build/classes:/usr/build/lib/*.jar:/usr/share/zookeeper/zookeeper-3.5.1-metrika.jar:/usr/share/zookeeper/slf4j-log4j12-1.7.5.jar:/usr/share/zookeeper/slf4j-api-1.7.5.jar:/usr/share/zookeeper/servlet-api-2.5-20081211.jar:/usr/share/zookeeper/netty-3.7.0.Final.jar:/usr/share/zookeeper/log4j-1.2.16.jar:/usr/share/zookeeper/jline-2.11.jar:/usr/share/zookeeper/jetty-util-6.1.26.jar:/usr/share/zookeeper/jetty-6.1.26.jar:/usr/share/zookeeper/javacc.jar:/usr/share/zookeeper/jackson-mapper-asl-1.9.11.jar:/usr/share/zookeeper/jackson-core-asl-1.9.11.jar:/usr/share/zookeeper/commons-cli-1.2.jar:/usr/src/java/lib/*.jar:/usr/etc/zookeeper"

ZOOCFG="$ZOOCFGDIR/zoo.cfg"
ZOO_LOG_DIR=/var/log/$NAME
USER=zookeeper
GROUP=zookeeper
PIDDIR=/var/run/$NAME
PIDFILE=$PIDDIR/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
JAVA=/usr/bin/java
ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
ZOO_LOG4J_PROP="INFO,ROLLINGFILE"
JMXLOCALONLY=false
JAVA_OPTS="-Xms{{ cluster.get('xms','128M') }} \
    -Xmx{{ cluster.get('xmx','1G') }} \
    -Xloggc:/var/log/$NAME/zookeeper-gc.log \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=16 \
    -XX:GCLogFileSize=16M \
    -verbose:gc \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCDetails
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintGCApplicationConcurrentTime \
    -XX:+PrintSafepointStatistics \
    -XX:+UseParNewGC \
    -XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled"

数据备份

数据除了存储在ck之外，可以在hdfs中保留一份，以防止ck数据丢失后，无法恢复。

配置文件

ck的默认配置文件为/etc/clickhouse-server/config.xml，你可以在其中指定所有的服务器配置。

当然你可以将各种不同的配置分开，比如user的配置，和quota的配置，单独放一个文件，其余文件放置的路径为

 /etc/clickhouse-server/config.d

ck最终会将所有的配置合在一起生成一个完整的配置file-preprocessed.xml

各个分开的配置，可以覆盖或删除主配置中的相同配置，使用replace或remove属性就行，比如

<query_masking_rules>
    <rule>
        <name>hide SSN</name>
        <regexp>\b\d{3}-\d{2}-\d{4}\b</regexp>
        <replace>000-00-0000</replace>
    </rule>
</query_masking_rules>

同时ck还可以使用zk做为自己的配置源，即最终配置文件的生成，会使用zk中的配置。

默认情况下：
users, access rights, profiles of settings, quotas这些设置都在users.xml

一些最佳实践

一些踩坑处理：
1.Too many parts(304). Merges are processing significantly slower than inserts 问题是因为插入的太平凡，插入速度超过了后台merge的速度，解决版本办法是，增大background_pool_size和降低插入速度，官方建议“每秒不超过1次的insert request”，实际是每秒的写入影响不要超过一个文件。如果写入的数据涉及多个分区文件，很可能还是出现这个问题。所以分区的设置一定要合理
2.DB::NetException: Connection reset by peer, while reading from socket xxx 。很有可能是没有配置max_memory_usage和max_memory_usage_for_all_queries，导致内存超限，ck server被操作系统杀死
3.Memory limit (for query) exceeded:would use 9.37 GiB (attempt to allocate chunk of 301989888 bytes), maximum: 9.31 GiB 。是由于我们设置了ck server的内存使用上线。那些超限的请求被ck杀死，但ck本身并没有挂。这个时候就要增加max_bytes_before_external_sort和max_bytes_before_external_group_by配置，来利用上硬盘
4.ck的副本和分片依赖zk,所以zk是个很大的性能瓶颈，需要对zk有很好的认识和配置，甚至启用多个zk集群来支持ck集群
5.zk和ck建议都使用ssd,提升性能
对应文章：https://mp.weixin.qq.com/s/egzFxUOAGen_yrKclZGVag

参考资料

https://clickhouse.yandex/docs/en/operations/tips/

http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

https://blog.nelhage.com/post/transparent-hugepages/

https://wiki.archlinux.org/index.php/CPU_frequency_scaling

参考资料

https://clickhouse.yandex/docs/en/operations/tips/

http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

https://blog.nelhage.com/post/transparent-hugepages/

https://wiki.archlinux.org/index.php/CPU_frequency_scaling

欢迎关注我的个人公众号"西北偏北UP"，记录代码人生，行业思考，科技评论