7. 对象存储RadosGW
7.1 概念
ceph是一种分布式对象存储系统,通过ceph对象网关提供对象存储接口,也称为RADOS网关(RGW)接口。它构建在Ceph RADOS之上。RGW使用librgw(RADOS Gateway library)和librados,允许应用程序与ceph对象存储建立连接。RGW为应用程序提供了一个RESTful S3/swift兼容的接口,用于在ceph集群中以对象的形式存储数据。ceph还支持多租户对象存储,可以通过RESTful API访问。此外,RGW还支持ceph管理API,可以使用本机API调用来管理ceph存储集群。librados软件库非常灵活,允许用户应用程序通过C、C++、java、python和php绑定直接访问ceph存储集群。ceph对象存储还具有多站点功能,即灾难恢复提供解决方案。
7.2 RGW特点
1. 通过对象存储将数据存储为对象,每个对象除了包含数据,还包含数据自身的元数据。
2. 对象通过 Object ID 来检索,无法通过普通文件系统的方式通过文件路径及文件名称操作来直接访问对象,只能通过 API 来访问,或者第三方客户端(实际上也是对 API 的封装)。
3. 对象存储中的对象不整理到目录树中,而是存储在扁平的命名空间中,Amazon S3 将这个扁平命名空间称为 bucket,而 swift 则将其称为容器。
4. 无论是 bucket 还是容器,都不能嵌套。
5. bucket 需要被授权才能访问到,一个帐户可以对多个 bucket 授权,而权限可以不同。
6. 方便横向扩展、快速检索数据。
7. 不支持客户端挂载,且需要客户端在访问的时候指定文件名称。
8. 不是很适用于文件过于频繁修改及删除的场景。
ceph 使用 bucket 作为存储桶(存储空间),实现对象数据的存储和多用户隔离,数据存储在
bucket 中,用户的权限也是针对 bucket 进行授权,可以设置用户对不同的 bucket 拥有不通
的权限,以实现权限管理。
bucket特性
1. 存储空间是您用于存储对象(Object)的容器,所有的对象都必须隶属于某个存储空间,可
2. 以设置和修改存储空间属性用来控制地域、访问权限、生命周期等,这些属性设置直接作用
于该存储空间内所有对象,因此您可以通过灵活创建不同的存储空间来完成不同的管理功能。
3. 同一个存储空间的内部是扁平的,没有文件系统的目录等概念,所有的对象都直接隶属于其
对应的存储空间。
4. 每个用户可以拥有多个存储空间
5. 存储空间的名称在 OSS 范围内必须是全局唯一的,一旦创建之后无法修改名称。
6. 存储空间内部的对象数目没有限制。
bucket 命名规范
1. 只能包括小写字母、数字和短横线(-)。
2. 必须以小写字母或者数字开头和结尾。
3. 长度必须在 3-63 字节之间
7.3 RGW对外的三类基础逻辑数据实体
7.3.1 用户
RGW 兼容AWS S3和OpenStack Swift。RGW User对应S3 User,也对应Swift Account,RGW subuser对应Swift user。
用户数据信息包含:
- 用户认证信息: S3(access key,secret key),Swift(secret key)
- 访问控制权限信息: 包含操作访问权限(read、write、delete等)和访问控制列表
- 用户配额信息: 防止某些用户占用过多存储空间,根据用户付费情况配置存储空间
7.3.2 存储桶
存储桶是对象的容器,是为了方便管理和操作具有相同属性的一类对象而引入的一级管理单元。
存储桶信息包含:
基础信息: (保持在对应RADOS对象的数据部分)RGW关注的信息,包含bucket配额信息(最大对象数目和最大对象大小总和),bucket placement rule,bucket中的索引对象数目等等。其中bucket placement rule包含index pool(存储桶的索引对象),data pool(对象数据),data extra pool(分段上传的中间数据)
扩展信息: (保存在对应RADOS对象的扩展属性)对RGW透明的一些信息,如用户自定义的元数据信息。
7.3.3 对象
RGW中的应用对象对应RADOS对象。应用对象上传分整体上传和分段上传,不同的上传方式应用对象对应RADOS对象的方式不同。
这里涉及三个概念:
- rgw_max_chunk_size: 分块大小,RGW下发至RADOS集群的单个IO的大小。
- rgw_obj_stripe_size: 条带大小,multipart除首对象外的分段其它大小。
- class RGWObjManifest: 管理应用对象和RADOS对象的对应关系。
整体上传
应用对象大小小于等于分块大小: 用户上传的一个对象只对象一个RADOS对象,该RADOS对象以应用对象名称命名,应用对象元数据也保存在该RADOS对象的扩展属性中。
应用对象大小大于分块大小: 应用对象被分解成一个大小等于分块大小的首对象。多个大小等于条带大小的中间镜像,和一个大小小于条带大小的尾对象。首对象以应用对象名称命名,在RGW中将该对象称为head_obj,该对象的数据部分保持了应用对象前rgw_max_chunk_size字节的数据。扩展属性部分保存了应用对象的元数据和manifest信息。中间对象和尾对象保存应用对象剩余的数据,对象名称为"shadow_" + "." + "32bit随机字符串" + "_" + "条带编号",其中条带编号从1开始。
分段上传
RGW依照条带大小将应用对象的每一个分段分成多个RADOS对象,每个分段的第一个RADOS对象名称为:
"multipart" + "用户上传对象名称" + "分段上传ID" + "分段编号"
其余对象名称为:
"shadow" + "用户上传对象名称" + "分段上传ID" + "分段编号" + "_" + "条带编号"
当所有的分段上传结束后,RGW会从data_extra_pool中的分段上传临时对象中读取各个分段信息。主要是各分段的manifest信息组成一个manifest,然后生成一个新的RADOS对象,即head obj,用来报错分段上传的应用对象的元数据信息和各分段的manifest
8. 对象存储部署
8.1 安装并初始化
# 在node1和node2上部署radosgw高可用服务
root@ceph-node1:~# apt install radosgw
root@ceph-node2:~# apt install radosgw
#在 ceph deploy 服务器将 ceph-node1 和 ceph-node2初始化为 radosGW 服务
cephadm@ceph-deploy:~/ceph-cluster$ ceph-deploy rgw create ceph-node1
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/cephadm/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/bin/ceph-deploy rgw create ceph-node1
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] rgw : [('ceph-node1', 'rgw.ceph-node1')]
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] subcommand : create
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f1e8aa9bf00>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] func : <function rgw at 0x7f1e8b346750>
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.rgw][DEBUG ] Deploying rgw, cluster ceph hosts ceph-node1:rgw.ceph-node1
[ceph-node1][DEBUG ] connection detected need for sudo
[ceph-node1][DEBUG ] connected to host: ceph-node1
[ceph-node1][DEBUG ] detect platform information from remote host
[ceph-node1][DEBUG ] detect machine type
[ceph_deploy.rgw][INFO ] Distro info: Ubuntu 18.04 bionic
[ceph_deploy.rgw][DEBUG ] remote host will use systemd
[ceph_deploy.rgw][DEBUG ] deploying rgw bootstrap to ceph-node1
[ceph-node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph-node1][WARNIN] rgw keyring does not exist yet, creating one
[ceph-node1][DEBUG ] create a keyring file
[ceph-node1][DEBUG ] create path recursively if it doesn't exist
[ceph-node1][INFO ] Running command: sudo ceph --cluster ceph --name client.bootstrap-rgw --keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring auth get-or-create client.rgw.ceph-node1 osd allow rwx mon allow rw -o /var/lib/ceph/radosgw/ceph-rgw.ceph-node1/keyring
[ceph-node1][INFO ] Running command: sudo systemctl enable ceph-radosgw@rgw.ceph-node1
[ceph-node1][WARNIN] Created symlink /etc/systemd/system/ceph-radosgw.target.wants/ceph-radosgw@rgw.ceph-node1.service → /lib/systemd/system/ceph-radosgw@.service.
[ceph-node1][INFO ] Running command: sudo systemctl start ceph-radosgw@rgw.ceph-node1
[ceph-node1][INFO ] Running command: sudo systemctl enable ceph.target
[ceph_deploy.rgw][INFO ] The Ceph Object Gateway (RGW) is now running on host ceph-node1 and default port 7480
cephadm@ceph-deploy:~/ceph-cluster$ ceph-deploy rgw create ceph-node2
...
8.2 验证radosgw服务
cephadm@ceph-deploy:~/ceph-cluster$ ceph -s
cluster:
id: 06d842e1-95c5-442d-b7fe-618050963147
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-node1,ceph-node2,ceph-node3 (age 9m)
mgr: ceph-node1(active, since 9m), standbys: ceph-node2
mds: 2/2 daemons up, 1 standby
osd: 5 osds: 5 up (since 9m), 5 in (since 2w)
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 9 pools, 297 pgs
objects: 290 objects, 183 MiB
usage: 725 MiB used, 99 GiB / 100 GiB avail
pgs: 297 active+clean
8.3 验证radosgw进程是否启动
cephadm@ceph-deploy:~/ceph-cluster$ ssh ceph-node1 "sudo ps -ef|grep radosgw"
ceph 3916 1 0 23:23 ? 00:00:03 /usr/bin/radosgw -f --cluster ceph --name client.rgw.ceph-node1 --setuser ceph --setgroup ceph
cephadm 4774 4773 0 23:32 ? 00:00:00 bash -c sudo ps -ef|grep radosgw
cephadm 4776 4774 0 23:32 ? 00:00:00 grep radosgw
cephadm@ceph-deploy:~/ceph-cluster$ ssh ceph-node2 "sudo ps -ef|grep radosgw"
ceph 3553 1 0 23:25 ? 00:00:02 /usr/bin/radosgw -f --cluster ceph --name client.rgw.ceph-node2 --setuser ceph --setgroup ceph
cephadm 4319 4318 0 23:32 ? 00:00:00 bash -c sudo ps -ef|grep radosgw
cephadm 4321 4319 0 23:32 ? 00:00:00 grep radosgw
8.4 测试访问radosgw服务
cephadm@ceph-deploy:~/ceph-cluster$ curl http://192.168.1.100:7480
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>cephadm@ceph-deploy:~/ceph-cluster$
8.5 自定义radogw的HTTP端口
# ceph.conf增加
cephadm@ceph-deploy:~/ceph-cluster$ vim ceph.conf
...
[client.rgw.ceph-node1]
rgw_host = ceph-node1
rgw_frontends = civetweb port=8888
[client.rgw.ceph-node2]
rgw_host = ceph-node2
rgw_frontends = civetweb port=8888
# ceph.conf配置文件同步到node1和node2节点上
cephadm@ceph-deploy:~/ceph-cluster$ scp ceph.conf root@192.168.1.100:/etc/ceph
The authenticity of host '192.168.1.100 (192.168.1.100)' can't be established.
ECDSA key fingerprint is SHA256:9kHyC5k68pyboHx6VtTk2Id+y5UEBN3P0ZyM0srTZBc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.100' (ECDSA) to the list of known hosts.
root@192.168.1.100's password:
ceph.conf 100% 505 286.4KB/s 00:00
cephadm@ceph-deploy:~/ceph-cluster$ scp ceph.conf root@192.168.1.101:/etc/ceph
The authenticity of host '192.168.1.101 (192.168.1.101)' can't be established.
ECDSA key fingerprint is SHA256:9kHyC5k68pyboHx6VtTk2Id+y5UEBN3P0ZyM0srTZBc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.101' (ECDSA) to the list of known hosts.
root@192.168.1.101's password:
ceph.conf 100% 505 279.9KB/s 00:00
# 重启rgw服务
cephadm@ceph-deploy:~/ceph-cluster$ ssh root@ceph-node1 'systemctl restart ceph-radosgw@rgw.ceph-node1.service'
root@ceph-node1's password:
cephadm@ceph-deploy:~/ceph-cluster$ ssh root@ceph-node2 'systemctl restart ceph-radosgw@rgw.ceph-node2.service'
root@ceph-node2's password:
# 查看node节点端口是否改成8888
root@ceph-node1:~# netstat -naptl|grep "radosgw"|grep "LISTEN"
tcp 0 0 0.0.0.0:8888 0.0.0.0:* LISTEN 5316/radosg
8.5 基于haproxy配置高可用
# 安装haproxy服务
root@ceph-client:~# apt install haproxy
# 修改haproxy服务的配置文件
root@ceph-client:~# cat /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
frontend http-rgw
bind *:80
mode http
option httplog
log global
default_backend httprgw
backend httprgw
balance leastconn ##轮询算法
server rgw1 192.168.1.100:8888 cookie 1 weight 5 check inter 2000 rise 2 fall 3
server rgw2 192.168.1.101:8888 cookie 1 weight 5 check inter 2000 rise 2 fall 3
# 启动haproxy服务
root@ceph-client:~# systemctl start haproxy
# 检查haproxy进程
root@ceph-client:~# ps -ef|grep haproxy
root 1737 1 0 Sep02 ? 00:00:00 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy 1742 1737 0 Sep02 ? 00:00:00 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
root 20003 1577 0 00:02 pts/0 00:00:00 grep --color=auto haproxy
访问页面
8.6 创建rgw用户
cephadm@ceph-deploy:~/ceph-cluster$ radosgw-admin user create --uid="popuser1" --display-name="popuser1"
{
"user_id": "popuser1",
"display_name": "popuser1",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"subusers": [],
"keys": [
{
"user": "popuser1",
"access_key": "TXKCHLKLKQ1P9RMDB05H",
"secret_key": "kpnQQY483ehwcnvS7tzb0Xa0RHiCApZIrbBSTCMB"
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"default_storage_class": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}
8.7 创建bucket
rgw server配置
cephadm@ceph-deploy:~/ceph-cluster$ cat ceph.conf
[global]
fsid = 06d842e1-95c5-442d-b7fe-618050963147
public_network = 172.16.1.0/24
cluster_network = 192.168.1.0/24
mon_initial_members = ceph-node1
mon_host = 172.16.1.100
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
[mds.ceph-node1]
mds_standby_for_name = ceph-node2
mds_standby_replay = true
[client.rgw.ceph-node1]
rgw_host = ceph-node1
rgw_frontends = civetweb port=8888
rgw_dns_name = rgw.pop.com
[client.rgw.ceph-node2]
rgw_host = ceph-node2
rgw_frontends = civetweb port=8888
rgw_dns_name = rgw.pop.com
安装s3cmd客户端
cephadm@ceph-deploy:~/ceph-cluster$ sudo apt install s3cmd
添加域名解析到host
root@ceph-deploy:~# echo "192.168.1.100 rgw.pop.com" >> /etc/hosts
配置s3cmd客户端执行环境
cephadm@ceph-deploy:~/ceph-cluster$ s3cmd --configure
Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.
Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
Access Key: TXKCHLKLKQ1P9RMDB05H #这里就是刚才创建帐号生成的
Secret Key: kpnQQY483ehwcnvS7tzb0Xa0RHiCApZIrbBSTCMB #这里也是刚才创建帐号生成的
Default Region [US]:
Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]: 192.168.1.120:80 # 有域名就写域名,我以ha的ip为例
Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
if the target S3 system supports dns based buckets.
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: 192.168.1.120:80/%(bucket) # 这里也是域名
Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password: # 是否加密上面输入的key
Path to GPG program [/usr/bin/gpg]: # 这里如果你单独软件加密软件,需要这里指定路径
When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
Use HTTPS protocol [Yes]: No # 是否启动https
On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name: # 是否用代理
New settings:
Access Key: TXKCHLKLKQ1P9RMDB05H
Secret Key: kpnQQY483ehwcnvS7tzb0Xa0RHiCApZIrbBSTCMB
Default Region: US
S3 Endpoint: rgw.pop.com:8888
DNS-style bucket+hostname:port template for accessing a bucket: rgw.pop.com:8888/%(bucket)
Encryption password:
Path to GPG program: /usr/bin/gpg
Use HTTPS protocol: False
HTTP Proxy server name:
HTTP Proxy server port: 0
Test access with supplied credentials? [Y/n] y # 是否校验, 验证上面输入的信息是否正确
Please wait, attempting to list all buckets...
Success. Your access key and secret key worked fine :-)
Now verifying that encryption works...
Not configured. Never mind.
Save settings? [y/N] y # 这里表示校验成功,是否保存配置文件
Configuration saved to '/home/cephadm/.s3cfg'
# 命令行检验是否成功
cephadm@ceph-deploy:~/ceph-cluster$ s3cmd la #没有报错,说明成功
cephadm@ceph-deploy:~/ceph-cluster$
创建bucket
root@ceph-deploy:~# s3cmd mb s3://popdata
Bucket 's3://popdata/' created
8.8 测试上传/下载文件
上传文件
root@ceph-deploy:/var/log# s3cmd put syslog s3://popdata
upload: 'syslog' -> 's3://popdata/syslog' [1 of 1]
6726035 of 6726035 100% in 2s 2.65 MB/s done
# 查看上传的文件
root@ceph-deploy:/var/log# s3cmd la
2021-09-02 17:10 6726035 s3://popdata/syslog
root@ceph-deploy:/var/log# s3cmd ls
2021-09-02 17:07 s3://popdata
root@ceph-deploy:/var/log# s3cmd ls s3://popdata
2021-09-02 17:10 6726035 s3://popdata/syslog
root@ceph-deploy:/var/log# ceph osd lspools
1 device_health_metrics
2 popool
3 poprbd1
4 popcephfsmetadata
5 popcephfsdata
6 .rgw.root
7 default.rgw.log
8 default.rgw.control
9 default.rgw.meta
10 default.rgw.buckets.index
11 default.rgw.buckets.data //会自动生成
注:如果上传相同的名称,相同的路径,会覆盖之前的文件
下载文件
root@ceph-deploy:/var/log# s3cmd get s3://popdata/syslog /opt/
download: 's3://popdata/syslog' -> '/opt/syslog' [1 of 1]
6726035 of 6726035 100% in 0s 123.83 MB/s done
8.9 验证PG和PGP的组合信息
root@ceph-deploy:/var/log# ceph pg ls-by-pool default.rgw.buckets.data|awk '{print $1,$2,$15}'
PG OBJECTS ACTING
11.0 0 [1,2,5]p1
11.1 0 [2,0,5]p2
11.2 0 [0,3,5]p0
11.3 0 [5,3,1]p5
11.4 0 [2,5,1]p2
11.5 0 [3,1,5]p3
11.6 0 [0,3,5]p0
11.7 0 [2,1,5]p2
11.8 0 [0,5,3]p0
11.9 0 [1,3,5]p1
11.a 0 [5,3,0]p5
11.b 0 [0,2,5]p0
11.c 0 [1,2,5]p1
11.d 0 [1,3,5]p1
11.e 0 [5,1,3]p5
11.f 0 [3,5,0]p3
11.10 0 [0,5,3]p0
11.11 0 [1,2,5]p1
11.12 0 [2,0,5]p2
11.13 0 [5,3,0]p5
11.14 0 [3,1,5]p3
11.15 0 [0,2,5]p0
11.16 1 [3,5,1]p3
11.17 0 [5,2,1]p5
11.18 0 [1,2,5]p1
11.19 1 [0,3,5]p0
11.1a 0 [3,0,5]p3
11.1b 0 [2,5,0]p2
11.1c 0 [2,5,1]p2
11.1d 0 [3,0,5]p3
11.1e 0 [3,0,5]p3
11.1f 0 [1,5,3]p1
* NOTE: afterwards
9. CRUSH进阶
9.1 特性
-
数据分布和负载均衡
- 数据分布均衡,使数据能均匀的分布到各个节点上。
- 负载均衡,使数据访问读写操作的负载在各个节点和磁盘的负载均衡。
-
灵活应对集群伸缩
- 系统可以方便的增加或者删除节点设备,并且对节点失效进行处理。
- 增加或者删除节点设备后,能自动实现数据的均衡,并且尽可能少的迁移数据。
-
支持大规模集群
- 要求数据分布算法维护的元数据相对较小,并且计算量不能太大。随着集群规模的增 加,数据分布算法开销相对比较小。
9.2 简介
- CRUSH算法的全称为:Controlled Scalable Decentralized Placement of Replicated Data,可控的、可扩展的、分布式的副本数据放置算法。
- PG到OSD的映射的过程算法叫做CRUSH 算法。(一个Object需要保存三个副本,也就是需要保存在三个osd上)。
- CRUSH算法是一个伪随机的过程,他可以从所有的OSD中,随机性选择一个OSD集合,但是同一个PG每次随机选择的结果是不变的,也就是映射的OSD集合是固定的。
9.3 运行图
ceph集群中由mon服务器维护的五种运行图:
monitor map # 监视器运行图
osd map # osd运行图
pg map # pg运行图
crush map # crush 运行图,当新建存储池时会基于osd map创建出新的pg组合列表用于存储数据
mds map # cephs metadata运行图
obj-->pg hash(oid)%pg=pgid
obj-->osd crush 根据当前的mon运行图返回pg内的最新的osd组合,数据即可开始往主的写然后往副本osd同步
crush算法针对目的节点的选择:
目前有5种算法来实现节点的选择,包活Uniform、List、Tree、Straw、Straw2,早期版本使用的是ceph项目的发起者发明的算法Straw,目前
已经发展到Straw2版本。
9.4 PG与OSD映射调整
默认情况下,crush算法自行对创建的pool中的PG分配OSD,但是可以手动基于权重设置crush算法分配数据的倾向性,比如1T的磁盘权重是1,2T的就是2,推荐使用相同大小的设备.
9.4.1 查看当前状态
weight表示设备(device)的容量相对值,比如1TB对应1.00,那么500G的OSD的weight就应该是0.5,weight是
基于磁盘空间分配PG的数量,让crush算法尽可能往磁盘空间大的OSD多分配PG。往磁盘空间小的OSD分配较小的PG。
Reweight 参数的目的是重新平衡ceph的CRUSH算法随机分配的PG,默认的分配是概率上的均衡,即使OSD都是一样
的磁盘空间也会产生一些PG分布不均匀的情况,此时可以通过调整reweight参数,让ceph集群立即重新平衡当前磁盘
的pg,以达到数据均衡分布的目的,REweight是PG已经分配完成,要在ceph集群重新平衡PG的分布。
root@ceph-deploy:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.01949 1.00000 20 GiB 107 MiB 85 MiB 18 KiB 22 MiB 20 GiB 0.52 0.74 167 up
1 hdd 0.01949 1.00000 20 GiB 150 MiB 102 MiB 20 KiB 48 MiB 20 GiB 0.73 1.04 170 up
2 hdd 0.01949 1.00000 20 GiB 124 MiB 68 MiB 26 KiB 56 MiB 20 GiB 0.61 0.86 171 up
3 hdd 0.01949 1.00000 20 GiB 137 MiB 119 MiB 12 KiB 18 MiB 20 GiB 0.67 0.95 166 up
5 hdd 0.01949 1.00000 20 GiB 201 MiB 183 MiB 32 KiB 18 MiB 20 GiB 0.98 1.40 337 up
TOTAL 100 GiB 718 MiB 556 MiB 110 KiB 162 MiB 99 GiB 0.70
MIN/MAX VAR: 0.74/1.40 STDDEV: 0.16
9.4.2 修改weight并验证
修改某个指定ID的osd的权重
root@ceph-deploy:~# ceph osd crush reweight osd.5 1.5
reweighted item id 5 name 'osd.5' to 1.5 in crush map
# 验证osd权重
root@ceph-deploy:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.01949 1.00000 20 GiB 95 MiB 85 MiB 18 KiB 11 MiB 20 GiB 0.47 0.67 130 up
1 hdd 0.01949 1.00000 20 GiB 135 MiB 102 MiB 20 KiB 33 MiB 20 GiB 0.66 0.94 106 up
2 hdd 0.01949 1.00000 20 GiB 133 MiB 68 MiB 26 KiB 65 MiB 20 GiB 0.65 0.93 121 up
3 hdd 0.01949 1.00000 20 GiB 141 MiB 119 MiB 12 KiB 22 MiB 20 GiB 0.69 0.99 124 up
5 hdd 1.50000 1.00000 20 GiB 211 MiB 183 MiB 32 KiB 28 MiB 20 GiB 1.03 1.47 337 up
TOTAL 100 GiB 716 MiB 556 MiB 110 KiB 159 MiB 99 GiB 0.70
9.5 crush运行图管理
通过工具将ceph的crush运行图导出并进行编辑,然后导入
9.5.1 导出crush运行图
导出的crush运行图为二进制格式,无法通过文本编辑器直接打开,需要使用crushtool 工具转换为文本格式后才能通过vim等文本编辑工具打开和编辑
root@ceph-deploy:~# mkdir /data/ceph -p
root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap
28
9.5.2 将运行图转换为文本
导出的运行图不能直接编辑,需要转换为文本格式再进行查看与编辑
# 需要安装ceph-base
root@ceph-deploy:~# apt install ceph-base
root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt
root@ceph-deploy:~# file /data/ceph/crushmap.txt
/data/ceph/crushmap.txt: ASCII text
9.5.3 编辑文件
root@ceph-deploy:~# vim /data/ceph/crushmap.txt
...
# devices #当前设备列表
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 5 osd.5 class hdd
# types
type 0 osd # osd守护进程,对应到一个磁盘设备
type 1 host # 一个主机
type 2 chassis # 刀片服务器的机箱
type 3 rack # 包含若干个服务器的机柜/机架
type 4 row # 包含若干个机柜的一排机柜
type 5 pdu # 机柜接入电源插座
type 6 pod # 一个机柜中的若干个小房间
type 7 room # 包含若干机柜的房间,一个数据中心有好多这样的房间组成
type 8 datacenter # 一个数据中心或IDS
type 9 zone
type 10 region # 一个区域,比如AWS宁夏中卫数据中心
type 11 root # bucket分层的最顶部,根
# buckets
host ceph-node1 { # 类型host 名称为ceph-node1
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 0.039
alg straw2 #crush算法,管理osd角色
hash 0 # rjenkins1 #使用哪个hash算法,0表示选择rjenkins1 这种的hash算法
item osd.0 weight 0.019 #osd0的权重比例,crush会自动根据磁盘空间计算,不同的磁盘空间权重不一样
item osd.1 weight 0.019
}
...
# rules
rule replicated_rule { # 副本池的默认配置
id 0
type replicated
min_size 1
max_size 10 # 默认最大副本数为10
step take default # 基于default定义的主机分配OSD
step chooseleaf firstn 0 type host # 选择主机,故障域类型为主机
step emit # 弹出配置即返回给客户端
}
9.5.4 将文本转换为crush格式
将最大默认副本数修改为8
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap
9.5.5 导入新的crush
导入的运行图会立即覆盖原有的运行图并立即生效
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap
29
9.5.6 验证crush运行图是否生效
root@ceph-deploy:~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 8, # 已经变成8个
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
9.6 将数据进行分类
ceph crush算法分配的pg的时候可以将pg分配到不同主机的osd上,以实现以主机为单位的高可用,这也是默认机制,但是无法保证不同pg位于不同机柜或机房的主机,如果要实现基于机柜或者更高级的idc等方式的数据高可用,而且也不能实现A项目的数据在SSD,B项目的数据在机械盘,如果想要实现此功能则需要导出crush运行图并手动编辑,之后再导入并覆盖原有的crush运行图。
9.6.1 编辑文件
将osd.1、osd.3、osd.5设置成ssd分组
root@ceph-deploy:~# vim /data/ceph/crushmap.txt
...
# pop ssd node
host ceph-ssdnode1 {
id -103
id -104 class hdd
alg straw2
hash 0
item osd.1 weight 0.098
}
host ceph-ssdnode2 {
id -105
id -106 class hdd
alg straw2
hash 0
item osd.3 weight 0.098
}
host ceph-ssdnode3 {
id -107
id -108 class hdd
alg straw2
hash 0
item osd.5 weight 0.098
}
# pop bucket
root ssd {
id -127
id -11 class hdd
alg straw
hash 0
item ceph-ssdnode1 weight 0.488
item ceph-ssdnode2 weight 0.488
item ceph-ssdnode3 weight 0.488
}
# pop ssd rules
rule pop_ssd_rule {
id 20
type replicated
min_size 1
max_size 5
step take ssd
step chooseleaf firstn 0 type host
step emit
}
9.6.2 将文本转换为crush格式
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap20210906
9.6.3 导入新的crush
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap20210906
30
9.6.4 验证crush运行图是否生效
root@ceph-deploy:~# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 8,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 20,
"rule_name": "pop_ssd_rule",
"ruleset": 20,
"type": 1,
"min_size": 1,
"max_size": 5,
"steps": [
{
"op": "take",
"item": -127,
"item_name": "ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
9.6.5 测试创建存储池
root@ceph-deploy:~# ceph osd pool create pop-ssdpool 32 32 pop_ssd_rule
pool 'pop-ssdpool' created
9.6.6 验证pgp
会发现,分布在osd1,3,5上
root@ceph-deploy:~# ceph pg ls-by-pool pop-ssdpool | awk '{print $1,$2,$15}'
PG OBJECTS ACTING
12.0 0 [5,3,1]p5
12.1 0 [5,3,1]p5
12.2 0 [1,3,5]p1
12.3 0 [3,5,1]p3
12.4 0 [3,1,5]p3
12.5 0 [3,1,5]p3
12.6 0 [3,5,1]p3
12.7 0 [5,3,1]p5
12.8 0 [5,3,1]p5
12.9 0 [1,5,3]p1
12.a 0 [3,5,1]p3
12.b 0 [1,5,3]p1
12.c 0 [5,3,1]p5
12.d 0 [1,3,5]p1
12.e 0 [3,5,1]p3
12.f 0 [3,1,5]p3
12.10 0 [3,5,1]p3
12.11 0 [3,1,5]p3
12.12 0 [3,5,1]p3
12.13 0 [5,1,3]p5
12.14 0 [1,5,3]p1
12.15 0 [3,1,5]p3
12.16 0 [5,1,3]p5
12.17 0 [3,5,1]p3
12.18 0 [5,1,3]p5
12.19 0 [1,5,3]p1
12.1a 0 [5,3,1]p5
12.1b 0 [1,3,5]p1
12.1c 0 [5,3,1]p5
12.1d 0 [1,3,5]p1
12.1e 0 [1,5,3]p1
12.1f 0 [3,1,5]p3
* NOTE: afterwards
10. ceph dashboard 及监控
Ceph dashboard 是通过一个 web 界面,对已经运行的 ceph 集群进行状态查看及功能配置等
功能.
10.1 启用 dashboard 插件
node1节点上
root@ceph-node1:~# apt install ceph-mgr-dashboard
ceph-deploy节点
# 查看所有模块
cephadm@ceph-deploy:~/ceph-cluster$ ceph mgr module ls
#启动模块
cephadm@ceph-deploy:~/ceph-cluster$ ceph mgr module enable dashboard
10.2 启用 dashboard 模块
Ceph dashboard 在 mgr 节点进行开启设置,并且可以配置开启或者关闭 SSL,如下:
cephadm@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ssl false # 关闭ssl
cephadm@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-node1/server_addr 192.168.1.100 # 指定 dashboard 监听地址
cephadm@ceph-deploy:~/ceph-cluster$ ceph config set mgr mgr/dashboard/ceph-node1/server_port 9009 # 指定端口
10.3 检查集群状态
cephadm@ceph-deploy:~/ceph-cluster$ ceph -s
cluster:
id: 06d842e1-95c5-442d-b7fe-618050963147
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-node1,ceph-node2,ceph-node3 (age 98m)
mgr: ceph-node1(active, since 5m), standbys: ceph-node2
mds: 2/2 daemons up, 1 standby
osd: 5 osds: 5 up (since 2h), 5 in (since 93m); 1 remapped pgs
rgw: 2 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 369 pgs
objects: 340 objects, 189 MiB
usage: 952 MiB used, 99 GiB / 100 GiB avail
pgs: 6/1020 objects misplaced (0.588%)
368 active+clean
1 active+clean+remapped
10.4 验证访问
# 如果对应mgr的节点端口没有启动,重启下mgr服务
root@ceph-node1:/var/log/ceph# systemctl restart ceph-mgr@ceph-node1.service
root@ceph-node1:/var/log/ceph# lsof -i :9009
root@ceph-node1:/var/log/ceph# lsof -i :9009
root@ceph-node1:/var/log/ceph# lsof -i :9009
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 5255 ceph 30u IPv4 74231 0t0 TCP ceph-node1:9009 (LISTEN)
root@ceph-node1:/var/log/ceph# lsof -i :9009
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ceph-mgr 5255 ceph 30u IPv4 74231 0t0 TCP ceph-node1:9009 (LISTEN)
dashboard界面,如下:
10.4.1 设置dashboard帐号和密码
cephadm@ceph-deploy:~/ceph-cluster$ touch pass
cephadm@ceph-deploy:~/ceph-cluster$ echo "12345678" > pass
cephadm@ceph-deploy:~/ceph-cluster$ ceph dashboard set-login-credentials pop -i pass
******************************************************************
*** WARNING: this command is deprecated. ***
*** Please use the ac-user-* related commands to manage users. ***
******************************************************************
Username and password updated
设置后,登录界面如下:
10.5 通过 prometheus 监控 ceph node
软件地址下载:https://prometheus.io/download/
10.5.1 部署prometheus
root@ceph-node1:/apps# tar -xvf prometheus-2.29.2.linux-amd64.tar.gz
prometheus-2.29.2.linux-amd64/
prometheus-2.29.2.linux-amd64/consoles/
prometheus-2.29.2.linux-amd64/consoles/index.html.example
prometheus-2.29.2.linux-amd64/consoles/node-cpu.html
prometheus-2.29.2.linux-amd64/consoles/node-disk.html
prometheus-2.29.2.linux-amd64/consoles/node-overview.html
prometheus-2.29.2.linux-amd64/consoles/node.html
prometheus-2.29.2.linux-amd64/consoles/prometheus-overview.html
prometheus-2.29.2.linux-amd64/consoles/prometheus.html
prometheus-2.29.2.linux-amd64/console_libraries/
prometheus-2.29.2.linux-amd64/console_libraries/menu.lib
prometheus-2.29.2.linux-amd64/console_libraries/prom.lib
prometheus-2.29.2.linux-amd64/prometheus.yml
prometheus-2.29.2.linux-amd64/LICENSE
prometheus-2.29.2.linux-amd64/NOTICE
prometheus-2.29.2.linux-amd64/prometheus
prometheus-2.29.2.linux-amd64/promtool
# 设置软连接
root@ceph-node1:/apps# ln -sv /apps/prometheus-2.29.2.linux-amd64 /apps/prometheus
'/apps/prometheus' -> '/apps/prometheus-2.29.2.linux-amd64'
# 开机自启配置
root@ceph-node1:/apps# cat <<"EOF">>/etc/systemd/system/prometheus.service
> [Unit]
> Description=Prometheus Server
> Documentation=https://prometheus.io/docs/introduction/overview/
> After=network.target
> [Service]
> Restart=on-failure
> WorkingDirectory=/apps/prometheus/
> ExecStart=/apps/prometheus/prometheus --config.file=/apps/prometheus/prometheus.yml
> [Install]
> WantedBy=multi-user.target
> EOF
root@ceph-node1:/apps# systemctl daemon-reload
root@ceph-node1:/apps# systemctl start prometheus
root@ceph-node1:/apps# systemctl enable prometheus
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.
9.5.2 访问prometheus
9.5.3 部署 node_exporter
root@ceph-node1:/apps# tar -xvf node_exporter-1.2.2.linux-amd64.tar.gz
node_exporter-1.2.2.linux-amd64/
node_exporter-1.2.2.linux-amd64/LICENSE
node_exporter-1.2.2.linux-amd64/NOTICE
node_exporter-1.2.2.linux-amd64/node_exporter
# 设置软连接
root@ceph-node1:/apps# ln -sv /apps/node_exporter-1.2.2.linux-amd64 /apps/node_exporter
'/apps/node_exporter' -> '/apps/node_exporter-1.2.2.linux-amd64'
# 设置开机自启
root@ceph-node1:/apps# cat <<"EOF">>/etc/systemd/system/node-exporter.service
> [Unit]
> Description=Prometheus Node Exporter
> After=network.target
> [Service]
> ExecStart=/apps/node_exporter/node_exporter
> [Install]
> WantedBy=multi-user.target
> EOF
root@ceph-node1:/apps# systemctl daemon-reload
root@ceph-node1:/apps# systemctl restart node-exporter
root@ceph-node1:/apps# systemctl enable node-exporter
Created symlink /etc/systemd/system/multi-user.target.wants/node-exporter.service → /etc/systemd/system/node-exporter.service.
验证各 node1 节点的 node_exporter 数据,访问:http://192.168.1.100:9100/metrics
注:node2、node3节点参考node1安装node_exporter
9.5.4 配置 prometheus server 数据并验证
root@ceph-node1:/apps# vim /apps/prometheus/prometheus.yml
...
- job_name: "ceph-node-data"
static_configs:
- targets: ['192.168.1.100:9100','192.168.1.101:9100','192.168.1.102:9100']
root@ceph-node1:/apps# systemctl restart prometheus
9.6 通过 prometheus 监控 ceph 服务
Ceph manager 内部的模块中包含了 prometheus 的监控模块,并监听在每个 manager 节点的
9283 端口,该端口用于将采集到的信息通过 http 接口向 prometheus 提供数据
官方插件文档地址:Prometheus plugin — Ceph Documentation
9.6.1 启用 prometheus 监控模块
cephadm@ceph-deploy:~/ceph-cluster$ ceph mgr module enable prometheus
#node1节点验证,启动端口9283
root@ceph-node1:/apps# ss -tnl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 5 192.168.1.100:9009 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6801 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6801 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6802 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6802 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6803 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6803 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6804 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6804 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6805 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6805 0.0.0.0:*
LISTEN 0 128 127.0.0.53%lo:53 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6806 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6806 0.0.0.0:*
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6807 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6807 0.0.0.0:*
LISTEN 0 128 0.0.0.0:8888 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6808 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6809 0.0.0.0:*
LISTEN 0 128 172.16.1.100:3300 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6789 0.0.0.0:*
LISTEN 0 128 192.168.1.100:6800 0.0.0.0:*
LISTEN 0 128 172.16.1.100:6800 0.0.0.0:*
LISTEN 0 128 [::]:22 [::]:*
LISTEN 0 128 *:9090 *:*
LISTEN 0 5 *:9283 *:*
LISTEN 0 128 *:9100 *:*
9.6.2 配置 prometheus 采集数据
root@ceph-node1:/apps/prometheus# vim prometheus.yml
...
- job_name: "ceph-cluster-data"
static_configs:
- targets: ['192.168.1.101:9283']
root@ceph-node1:/apps/prometheus# systemctl restart prometheus
9.7 安装并配置grafana 显示监控数据
9.7.1 安装grafana
deb包下载: grafana_7.5.10_amd64
root@ceph-node1:/usr/local/src# wget https://dl.grafana.com/oss/release/grafana_7.5.10_amd64.deb
root@ceph-node1:/usr/local/src# apt-get install -y adduser libfontconfig1
root@ceph-node1:/usr/local/src# dpkg -i grafana_7.5.10_amd64.deb
# 设置开机启动
root@ceph-node1:/usr/local/src# systemctl enable grafana-server
Synchronizing state of grafana-server.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable grafana-server
Created symlink /etc/systemd/system/multi-user.target.wants/grafana-server.service → /usr/lib/systemd/system/grafana-server.service.
# 启动grafana服务
root@ceph-node1:/usr/local/src# systemctl restart grafana-server
9.7.1 访问grafana
9.7.2 配置prometheus数据源:
9.7.3 导入开源模版
官网地址https://grafana.com/grafana/dashboards
例:Ceph - Cluster dashboard for Grafana | Grafana Labs-2842
注:根据需求添加模版