概念
- 机箱群集cluster-id(机箱集群中可以包含大量冗余组)
- 节点node id
- 冗余组
- 决定冗余组是否为主冗余组的因素有三个:为节点配置的优先级、节点 ID(节点 ID 0 号最低的节点始终优先,如果优先级难分高下)和节点的出现顺序。 如果优先级较低的节点首先出现,则将其视为冗余组的主项(如果未启用抢先,则将保持为主项)。
摘要
设备加入集群之后,即成为集群的一个节点。 除了唯一节点设置和管理 IP 地址之外,同一个集群中的节点共享相同的配置。
机箱群集概述
控制平面
- 用来在节点之间同步配置和内核状态
- 节点之间通过控制端口连接(注意哪个作为控制端口)
- 以主动/备动模式运行,两节点相互备份,一个当主,一个当辅,主设备出现故障,辅助设备将接管信息流的处理
数据平面
- 通过结构端口相连来形成一个统一的数据平面(注意哪个作为结构端口)
- 用来同步流经各个节点信息流的会话信息,从而确保执行故障切换时不会丢弃建立的会话
- 数据平面软件以主动/主动模式运行
集群节点的不同状态
- hold(等待)
- primary(主)
- secondary-hold(辅助-等待)
- secondary(辅助)
- ineligible(无资格)
- diabled(禁用)
配置机箱群集前要注意的事项
- 节点的硬件软件要一致
- 节点要先设置root-authentication密码,而且密码要一致
- 管理控制口不能有任何配置,否则有可能会因为控制口被占用导致通信不了而失败。如果不知道哪个是管理控制口,可以先恢复出厂设置(配置模式下:
load factory-default
),然后( run show configuration |display set|match interface )检查是否还有包含interface,有的话,用delete命令删除掉
下面的错误是没有设置密码引起
root# commit
[edit]
'system'
Missing mandatory statement: 'root-authentication'
error: commit failed: (missing statements)
这里提示没有配置根认证,这是由于第一次登陆Junos密码为空,配置root密码后再进行commit操作:
[edit]
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
[edit]
配置步骤
- 先对照前面的“配置机箱群集前要注意的事项”,确保满足
1.设置root-authentication
这一步不需要命名主机
- srx-a
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
- srx-b
root# set system root-authentication plain-text-password
New password:
Retype new password:
[edit]
root# commit
commit complete
2. 设置chassis cluster
分别在节点的cli模式下执行下面的命令,注意,命令(node)是不一样的
srx-a
root> set chassis cluster cluster-id 1 node 0 reboot
srx-b
root> set chassis cluster cluster-id 1 node 1 reboot
验证cluster
- show chassis cluster status
异常情况
如果管理控制口被占用或者由于其它原因导致不能同步,会出现下面这种情况。
- 节点srx-a
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 0 lost n/a n/a n/a
{primary:node0}
- 节点srx-b 同样看不到对方的状态
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 0 lost n/a n/a n/a
node1 1 primary no no None
注意到观察俩节点node的状态,如果出现这种(lost)情况,一般是由于控制口被占用导致两节点同步不了的,删除控制可的配置即可解决
正常情况
- 节点node0
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 1 secondary no no None
{primary:node0}
- 节点node1
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 0
node0 1 primary no no None
node1 1 secondary no no None
{secondary:node1}
测试接管
-
将主重起,备机马上接管
node0 重启完后,并不会自动切换为主,会处于“hold”然后变成"secondary"
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 hold no no None
node1 1 primary no no None
root@srx-b> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 secondary no no None
node1 1 primary no no None
手动切换主备
- cli模式下命令
request chassis cluster failover
要指定切换的是哪个redundancy-group和哪个节点node为主
root> request chassis cluster failover ?
Possible completions:
node Node identifier of the new primary (0..1)
redundancy-group Redundancy-group identifier (0..63)
reset Undo the previous failover command
root> request chassis cluster failover node 0 redundancy-group 0
node0:
--------------------------------------------------------------------------
Initiated manual failover for redundancy group 0
{secondary:node0}
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 255 primary no yes None
node1 1 secondary-hold no yes None
注意到这时node0 的priority为255。可以使用这个命令reset为1,(一般发生故障切换后,reset可以恢复原定的主备priority)
root> request chassis cluster failover reset redundancy-group 0
node0:
--------------------------------------------------------------------------
Successfully reset manual failover for redundancy group 0
node1:
--------------------------------------------------------------------------
No reset required for redundancy group 0.
{primary:node0}
root> show chassis cluster status
Monitor Failure codes:
CS Cold Sync monitoring FL Fabric Connection monitoring
GR GRES monitoring HW Hardware monitoring
IF Interface monitoring IP IP monitoring
LB Loopback monitoring MB Mbuf monitoring
NH Nexthop monitoring NP NPC monitoring
SP SPU monitoring SM Schedule monitoring
CF Config Sync monitoring
Cluster ID: 1
Node Priority Status Preempt Manual Monitor-failures
Redundancy group: 0 , Failover count: 1
node0 1 primary no no None
node1 1 secondary no no None
{primary:node0}
怎样取消双机集群配置
有两种方法:都是在操作模式(cli)下。
- set chassis cluster disable reboot 直接关
- set chassis cluster cluster-id 0 node 1 reboot 这种id 为0 时也会关掉。重启就可以了
配置过程中遇到的问题
问题1
root@srx-a# set groups node0 system host-name srx-A
{primary:node0}[edit]
root@srx-a# set groups node1 system host-name srx-B
{primary:node0}[edit]
root@srx-a# commit
[edit interfaces]
'ge-0/0/0'
HA management port cannot be configured
error: configuration check-out failed
{primary:node0}[edit]
解决方法:
When clustering is enabled ge-0/0/0 become fxp0(management interface) and ge-0/0/1 become fxp1 (control link).
https://kb.juniper.net/InfoCenter/index?page=content&id=KB15356
问题2
root@srx-a# commit
[edit security zones security-zone untrust]
'interfaces ge-0/0/0.0'
Interface ge-0/0/0.0 must be configured under interfaces
error: configuration check-out failed
原因是Interface ge-0/0/0.0不存在(Look into show interfaces and see if a ge-0/0/0 unit 0 is configured there !