问题现象
同配置几台服务器,安装ESXi系统,仅其中一台的vmnic3端口存在CRC报错计数,如下所示:
vmnic3
Total receive errors: 531
Receive CRC errors: 531
Total transmit errors: 15
Transmit carrier errors: 15
该网卡是一张Mellanox CX-4的网卡,vmnic1/3共同承担vsan/vmotion的流量。
PortGroup Name VLAN ID Used Ports Uplinks
vSAN 888 1 vmnic1,vmnic3
vMotion 999 1 vmnic3,vmnic1
已做排查
1、 表示已测试过更换该网卡端口对应的光纤线缆,无效。
2、 表示已测试过更换该网卡端口的模块,无效。
3、 表示集群共四台机器,只有这台机器的这个vmnic3存在问题。
4、 表示该网卡端口对应的交换机端口的发光速率已检查过,无问题。
问题分析
1、查看不同网卡的配置情况,比如链路协商等,未见不同和异常。
NIC: vmnic2
vmnic2 0000:32:00.0 nmlx5_core Up Up 10000 Full 58:a2:e1:5d:ea:5c 1500 Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
NICInfo:
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 1000BaseCX-SGMII/Full, 10000BaseKR/Full, 25000BaseTwinax/Full
NIC: vmnic3
vmnic3 0000:32:00.1 nmlx5_core Up Up 10000 Full 58:a2:e1:5d:ea:5d 1500 Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
NICInfo:
Advertised Auto Negotiation: true
Advertised Link Modes: Auto, 1000BaseCX-SGMII/Full, 10000BaseKR/Full, 25000BaseTwinax/Full
Auto Negotiation: true
2、vmnic3确实存在CRC报错异常,但其他网卡没有。
NIC statistics for vmnic2:
Packets received: 12338613
Packets sent: 7015755
Bytes received: 9063176871
Bytes sent: 8771220503
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 337927
Broadcast packets received: 132596
Multicast packets sent: 12791
Broadcast packets sent: 1438
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
NIC statistics for vmnic3:
Packets received: 38529788
Packets sent: 4276632
Bytes received: 54529496295
Bytes sent: 43956122050
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 291699
Broadcast packets received: 51928
Multicast packets sent: 11898
Broadcast packets sent: 225
Total receive errors: 531
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 531
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 15
Transmit aborted errors: 0
Transmit carrier errors: 15
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
NIC statistics for vmnic1:
Packets received: 242602034
Packets sent: 52688663
Bytes received: 300501395238
Bytes sent: 277895433396
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 292354
Broadcast packets received: 52243
Multicast packets sent: 11901
Broadcast packets sent: 428
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
3、查看系统日志,未见网卡或者driver存在异常的地方。
4、对比vmnic1/vmnic3的网络流量,vmnic3流量较小,因此基本可排除流量方面的因素。
方案
综合上面的情况以及CRC本身跟物理链路强相关的特性,利用交叉排错的思想,制定如下方案。
将vmnic2/3服务器端的光纤线缆交换,验证CRC报错计数变化现象:
如CRC计数仍旧在vmnic3上增长,则说明跟服务器外不相关。
如CRC计数变换至vmnic2上增长,则说明问题发生在服务器外,跟服务器不想关。
实际操作结果
1、首先查看了在更换了光模块后,vmnic3的计数仍旧是有增长的,因此跟光模块是无关。
2、换回原光模块,并通过增加网络流量(以便问题复现),可以明显观察到CRC计数增长。
3、因现场条件限制,无法对调服务器端vmnic2/vmnic3的光纤线缆,故操作如下
将该机器的vmnic3跟其他机器的vmnic3光纤线一并更换,并对调交换机接口,结果是: 该机器的vmnic3/其他机器的vmnic3均没有发生crc计数增长。
将该机器的vmnic3跟其他机器的vmnic3光纤线恢复原来的,并将对应交换机端口也恢复原来,该机器的vmnic3/其他机器的vmnic3均没有发生crc计数增长。
结论
原机器vmnic3端口对应交换机端口端 连接存在问题。