该文章为本人迁移旧集群的过程记录,为图方便直接打包了原配置文件直接部署到新机器上,而且排除了数据与日志存放目录,故下面有新建目录/文件的操作(其中有部分不需要手动创建),与新建集群的操作基本没有太大差异。留作记录,也供大家参考。
注:在三台节点上都要进行如下配置。
1,安装软件包:
apt-get update
##部分包互相依赖的,可能有重复:
apt-get -y install build-essential libpq-dev libxml2-dev libxslt1-dev libldap2-dev libsasl2-dev libffi-dev libssl-dev python-django-tagging python-simplejson python-memcache python-ldap python-cairo python-pysqlite2 python-support python-pip python-dev python-rrdtool zlib1g-dev gunicorn nodejs wget curl nginx supervisor git devscripts debhelper software-properties-common lftp
pip install Django==1.7.7
pip install Twisted==11.1.0
pip install whisper
2,解压原配置文件并创建数据与日志相关目录(因体积过大未拷贝):
tar -pzxvf grap-1.tar.gz -C /
tar -pzxvf grap-2.tar.gz -C /
tar -pzxf statsd.tar.gz -C /
tar -pzxvf supervisor.tar.gz -C /
mv /etc/supervisor/conf.d/{es.conf,es_monitor.conf} /etc/supervisor/conf.d/bak/
mkdir -p /data/graphite/storage/log/{carbon-cache,carbon-relay,webapp} /data/graphite/storage/whisper/b_statsd/{timers,counter} /data/graphite/storage/whisper/monitor /data/log/{statsd,graphite}
chown www-data:www-data /data/graphite/storage/log/
touch /data/graphite/storage/log/webapp/{exception.log,info.log}
3,修改各项目的配置文件:
- 修改statsd的配置文件(更改节点的IP地址):
/data/statsd/statsd.js*
ls *statsd.js | xargs sed -i 's/172.18.20.57/172.16.0.208/g'
ls *statsd.js | xargs sed -i 's/172.18.20.58/172.16.33.238/g'
ls *statsd.js | xargs sed -i 's/172.18.20.59/172.16.17.195/g'
##若不需要第三台集群可以通过以下命令删除该行配置:
ls *statsd.js | xargs sed -i '/172.18.20.59/d'
- 修改carbon相关配置文件(调整各参数并更改节点IP):
/opt/graphite/conf/carbon.conf
DESTINATIONS = 172.16.0.208:2014:a,172.16.33.238:2014:b,172.16.17.195:2014:c
/opt/graphite/conf/relay-rules.conf
destinations = 172.16.0.208:2004:a,172.16.33.238:2004:b,172.16.17.195:2004:c
- 修改graphite-web的配置文件:
/opt/graphite/webapp/graphite/local_settings.py
CLUSTER_SERVERS = ["172.16.0.208:80","172.16.33.238:86","172.16.17.195:86"]
4,配置nginx反代graphite:
相关目录需要注意www-data的权限,按原属性解压就没问题。
注:业务机器通过前面的slb传数据到后端的statsD集群上。
- cluster-node1:
cat /etc/nginx/conf.d/graphite.conf
server {
server_name grap.bilibili.co 172.16.0.208;
listen 80;
charset utf-8;
location / {
proxy_pass http://127.0.0.1:8000;
}
}
- cluster-node2:
cat /etc/nginx/conf.d/graphite.conf
server {
server_name grap.bilibili.co 172.16.0.209;
listen 86;
charset utf-8;
location / {
proxy_pass http://127.0.0.1:8000;
}
}
- cluster-node3:
cat /etc/nginx/site-enabled/01-graphite
server {
server_name grap.bilibili.co 172.16.17.195;
listen 86;
charset utf-8;
location / {
proxy_pass http://127.0.0.1:8000;
}
}
5,配置迁移完成,启动服务:
systemctl restart supervisor.service
nginx -t
nginx -s reload
6,排错与测试:
- supervisorctl打开失败:
# supervisorctl
unix:///var/run/supervisor.sock no such file
supervisor>
该错误是因为supervisord父进程没有预先启动。
起服务时需要先启动supervisord,然后再通过supervisorctl管理子进程。如下为启动成功:
# systemctl restart supervisor.service
# supervisorctl
activity-node1 RUNNING pid 31429, uptime 0:19:56
activity-node2 RUNNING pid 31417, uptime 0:19:56
activity_statsd RUNNING pid 31420, uptime 0:19:56
aso-node1 RUNNING pid 31419, uptime 0:19:56
aso-node2 RUNNING pid 31423, uptime 0:19:56
aso_statsd RUNNING pid 31418, uptime 0:19:56
carbon-cache RUNNING pid 31915, uptime 0:11:54
carbon-relay RUNNING pid 31919, uptime 0:11:48
graphite RUNNING pid 31974, uptime 0:10:29
supervisor>
- graphite进程启动失败:
有时由于Django的依赖包安装不完整或者版本不对,导致graphite启动失败,可以通过pip freeze 查看其他服务器上的配置,再通过pip install安装:
pip freeze > /tmp/django.txt #再将该文件拷贝至本服务器
pip install -r /tmp/django.txt #配置django环境
##有遇到django-tagging的版本过低(0.3.x)导致graphite启动失败
pip install django-tagging==0.4
- 数据传输测试:
echo "test.logstash.num:100|c" | nc -w 1 -u $IP $port
echo "test.logstash.num:100|c" | nc -w 1 -u 127.0.0.1 8921
echo "test.logstash.num:200|c" | nc -w 1 -u 127.0.0.1 7798
如果安装配置是正常的,在graphite的左侧会多出这些路径与数据表:metrics->b_stats->counters->test->logstash->num
以上。