Q: What’s the difference between iperf2 and iperf3? Whichone is preferred?
A:Iperf3 is single-threaded it'sbetter used for single-stream tests. iperf3 is not designed to run withmulti-threaded, need to use iperf-2
Though the names might imply iperf3 &iperf2 are related, they are implemented totally differently.
Q: What is NUMA? How to map between a PCI, device, port andNUMA?
A: NUMA: Non-Uniform Memory Access
Use the following link as reference:
Q: How to calculate the theoretical throughput of the NIC
A: Theoretical throughput calculation is L*T*W, where L =Line Code, T = Transfer rate and W = Width. Example: L = 128/130, T= 16 GT/sand W = 16b, we get ~252.06153846 Gb/s (the T for transaction will remove whenwe multiple in actual bits per transaction). For Dual port, the performance islimited to ~110Gb/s per side due to PCIe Gen 3 X16
Q:What isMPS and MRR?
A: MPS: PCIe’s Max Payload Size; MRR: Max Read Request
MPS is architecture dependent and determines the maximumnumber of bytes transferred on the PCIe as a single operation (the actualnumber of bits is of course determent by the PCIe width). MRR determines themaximum number of bytes allowed for a single read request response. Forexample, if we want to read 8KB from memory and MRR is 1KB, there would be 8read requests, each of 1KB.
Q: why there is packet loss in TCP?
A: 1. check physical counters using command: mlxlink -d -c to see if there isphysical errors.
2. confirm the MTU of the SUT and client servers are thesame.
3. check Card counters (ethtool -S) before and after runningthe test
Understand mlx5 ethtool counters:Mellanox Interconnect Community
Common packet loss reason :Chapter 35. Monitoring and tuning the RX ring buffer Red Hat Enterprise Linux 8 | Red Hat Customer Portal
Q: how to debug when the performance bottleneck is CPU?
A: CPU analysis is essential for understanding performancelimitations. Performance behavior can be roughly mapped to CPU related issues,system related issues and NIC related issues. Most of the networkinglimitations are CPU related and debugging them require proper CPU analysis.
1)mpstat: a tool see a general overview of the CPUutilization
2)Htop: a tool for viewing CPU utilization percore.
Think about the following questions:
CPU utilization:What is the CPU utilization on theserver side? Where are the CPU cycles spent? Which are is the most CPU-heavy?
distribution across cores:Is the expected CPU working? Howmany CPUs are working? Why?
Once we concluded CPU utilization is sub-optimal, we wouldlike to analyze what is executed on the CPU and how well it is executed. TheCPU exposes hooks that can show the functions that run on each core and howmuch of the CPU time they consume.
1)Perf tool samples the CPU and provides analysis.The ‘top’ command can be used for showing the executed functions.
Q: common tools
A:
1) Lspci : a Linux native tool for querying PCIedevices.
Exp:$ lspci | grep Mellanox
2) Ethtool: a generic tool for setting and queryingfor network devices information. Each network device driver (e.g. OFED) need toimplement a set of callbacks in order to support the tool API.
Exp:$ ethtool -i
3) MST tool: a part of the MFT (Mellanox FW Tools)package. It may be used for querying properties of Mellanox devices.
Exp:$ mst status -v
4) Mlnx_tune is a part of the OFED package. It maybe used for querying the system for performance relevant information and fortuning the system for specific scenarios.
EXP: $ mlnx_tune -rc
CASE1:
Server:
run_perftest_multi_devices -d mlx5_2,mlx5_3 -c 6,7 -C "ib_write_bw --size 65536 --duration20 --port 22222 --connection=RC --qp=1 --report_gbits --output=bandwidth"
Client:
run_perftest_multi_devices -d mlx5_2,mlx5_3 -c 6,7 -rgen-l-vrt-131 -C "ib_write_bw --size 65536 --duration 20 --port 22222--connection=RC --qp=1 --report_gbits --output=bandwidth"
Result: ~110.5 Gbps
Explanation: As written above in ‘PCI gen 3 with 8GT/sspeed relevant Throughput’ we got X16 width for our dual port device sooverall, we can achieve total of126.03 Gb/s, of course if we considerprotocol overhead that result make sense.
Conclusion: Yes, in this case the PCI is the bottleneck.
CASE2:
Server:
taskset -c 6-11 ib_write_bw -x 0 --ib-dev=mlx5_2--ib-port=1 --port=56872 --size=65536 --duration=20 --connection=RC --qp=1--report_gbits --output=bandwidth --bidirectional
Client:
taskset -c 6-11 ib_write_bw -x 0 --ib-dev=mlx5_2--ib-port=1 --port=56872 --size=65536 --duration=20 --connection=RC --qp=1--report_gbits --output=bandwidth --post_list=1 gen-l-vrt-131 --bidirectional
Results & Explanation: ~181 Gb/s,PCI write and read are different operations and does not affect each other, Inother words RX and TX are different flows and uses separate resources.