Notes about software RDMA interoperability to communicate with hosts which have RDMA capable network adapters.
Looked for documentation about how to use software RDMA
The Networking Guide for Red Hat Enterprise Linux 7 has 13.3. Configuring Soft-RoCE, but no section about configuring Soft-iWARP.
Compared to the Red Hat Enterprise Linux 8 documention, in the Red Hat Enterprise Linux 7 documentation:
- Describes using the
rxe_cfg
script, rather than therdma
program. - Doesn't have a warning about Soft-RoCE being provided as a Technology Preview only.
The Configuring InfiniBand and RDMA networks document for Red Hat Enterprise Linux 8 describes configuration steps:
Notes:
- Both the above are described as being provided as a Technology Preview only.
- Both the above describe using the
rdma
program, rather than scripts.
In Red Hat Enterprise Linux 9 Configuring and managing high-speed network protocols and RDMA hardware:
- Chapter 7. Configuring Soft-iWARP for Red Hat Enterprise Linux 9 contains:
The Soft-iWARP feature is deprecated and will be removed in RHEL 10.
Soft-iWARP is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production.
- There is no mention of configuring Soft-RoCE.
12.2. Removed hardware support for RHEL 9 says the Soft-RoCE (rdma_rxe)
driver has been removed.
In AlmaLinux with Kernel 4.18.0-372.16.1.el8_6.x86_64
there are modules for:
- siw : Software iWARP Driver
- rdma_rxe : Soft RDMA transport
The real-time Kernel 4.18.0-372.16.1.rt7.173.el8_6.x86_64
has the same modules.
rdma-core 37.2 has user space providers for:
- libsiw: A software implementation of the iWarp protocol
- librxe: A software implementation of the RoCE protocol
[mr_halfword@haswell-alma ~]$ ls /usr/lib64/libibverbs
libbnxt_re-rdmav34.so libhns-rdmav34.so libqedr-rdmav34.so
libcxgb4-rdmav34.so libirdma-rdmav34.so librxe-rdmav34.so
libefa-rdmav34.so libmlx4-rdmav34.so libsiw-rdmav34.so
libhfi1verbs-rdmav34.so libmlx5-rdmav34.so libvmw_pvrdma-rdmav34.so
The rxe_cfg
script has already been installed from libibverbs-utils-37.2-1.el8.x86_64 : Examples for the libibverbs library
Ubuntu 18.04.6 LTS with Kernel 4.15.0-189-generic
has a module for rdma_rxe
, but not siw
.
The user space providers:
mr_halfword@Haswell-Ubuntu:~$ ls /usr/lib/x86_64-linux-gnu/libibverbs
libbnxt_re-rdmav17.so libi40iw-rdmav17.so libnes-rdmav17.so
libcxgb3-rdmav17.so libipathverbs-rdmav17.so libocrdma-rdmav17.so
libcxgb4-rdmav17.so libmlx4-rdmav17.so libqedr-rdmav17.so
libhfi1verbs-rdmav17.so libmlx5-rdmav17.so librxe-rdmav17.so
libhns-rdmav17.so libmthca-rdmav17.so libvmw_pvrdma-rdmav17.so
The user space providers were installed from:
mr_halfword@Haswell-Ubuntu:~$ dpkg -S librxe-rdmav17.so
ibverbs-providers:amd64: /usr/lib/x86_64-linux-gnu/libibverbs/librxe-rdmav17.so
The description of which is:
mr_halfword@Haswell-Ubuntu:~$ apt-cache show ibverbs-providers
Package: ibverbs-providers
Architecture: amd64
Version: 17.1-1ubuntu0.2
Multi-Arch: same
Priority: optional
Section: net
Source: rdma-core
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Benjamin Drung <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 586
Provides: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Depends: libc6 (>= 2.14), libibverbs1 (>= 17)
Breaks: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Replaces: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Filename: pool/main/r/rdma-core/ibverbs-providers_17.1-1ubuntu0.2_amd64.deb
Size: 159800
MD5sum: 1e174747949bad7238634191306ad5be
SHA1: dd5a02453583e436805c062b9e5b835658c3c67f
SHA256: f43927d551a506be5ea8a8b88e576d723addc859d6c5e93e19b6dd95b2c874c3
SHA512: 646bfb69c53f5584de3837e5adab5820af127d4a7699f0af3d539b53147ff4e9307c8b7050c4db1ae953b7a457928fe4ac573d95bff8255affa504cab2e98061
Homepage: https://github.com/linux-rdma/rdma-core
Description-en: User space provider drivers for libibverbs
libibverbs is a library that allows userspace processes to use RDMA
"verbs" as described in the InfiniBand Architecture Specification and
the RDMA Protocol Verbs Specification. iWARP ethernet NICs support
RDMA over hardware-offloaded TCP/IP, while InfiniBand is a
high-throughput, low-latency networking technology. InfiniBand host
channel adapters (HCAs) and iWARP NICs commonly support direct
hardware access from userspace (kernel bypass), and libibverbs
supports this when available.
.
A RDMA driver consists of a kernel portion and a user space portion.
This package contains the user space verbs drivers:
.
- bnxt_re: Broadcom NetXtreme-E RoCE HCAs
- cxgb3: Chelsio T3 iWARP HCAs
- cxgb4: Chelsio T4 iWARP HCAs
- hfi1verbs: Intel Omni-Path HFI
- hns: HiSilicon Hip06 SoC
- i40iw: Intel Ethernet Connection X722 RDMA
- ipathverbs: QLogic InfiniPath HCAs
- mlx4: Mellanox ConnectX-3 InfiniBand HCAs
- mlx5: Mellanox Connect-IB/X-4+ InfiniBand HCAs
- mthca: Mellanox InfiniBand HCAs
- nes: Intel NetEffect NE020-based iWARP adapters
- ocrdma: Emulex OneConnect RDMA/RoCE device
- qedr: QLogic QL4xxx RoCE HCAs
- rxe: A software implementation of the RoCE protocol
- vmw_pvrdma: VMware paravirtual RDMA device
Description-md5: 64b42b06411dd091f1db96ec8c93d131
Task: samba-server, ubuntu-budgie-desktop
Supported: 5y
Package: ibverbs-providers
Architecture: amd64
Version: 17.1-1
Multi-Arch: same
Priority: optional
Section: net
Source: rdma-core
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Benjamin Drung <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 586
Provides: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Depends: libc6 (>= 2.14), libibverbs1 (>= 17)
Breaks: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Replaces: libcxgb3-1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1, libnes1
Filename: pool/main/r/rdma-core/ibverbs-providers_17.1-1_amd64.deb
Size: 159824
MD5sum: e289f4500f6dbec1f0794347af2a6c51
SHA1: a3b8df23610ed5ae12b3f5380f053dd5e1470408
SHA256: f3117e075111b94fc24d233d63a0ae2bf4ec60394ef1118add9eb25ad60be078
Homepage: https://github.com/linux-rdma/rdma-core
Description-en: User space provider drivers for libibverbs
libibverbs is a library that allows userspace processes to use RDMA
"verbs" as described in the InfiniBand Architecture Specification and
the RDMA Protocol Verbs Specification. iWARP ethernet NICs support
RDMA over hardware-offloaded TCP/IP, while InfiniBand is a
high-throughput, low-latency networking technology. InfiniBand host
channel adapters (HCAs) and iWARP NICs commonly support direct
hardware access from userspace (kernel bypass), and libibverbs
supports this when available.
.
A RDMA driver consists of a kernel portion and a user space portion.
This package contains the user space verbs drivers:
.
- bnxt_re: Broadcom NetXtreme-E RoCE HCAs
- cxgb3: Chelsio T3 iWARP HCAs
- cxgb4: Chelsio T4 iWARP HCAs
- hfi1verbs: Intel Omni-Path HFI
- hns: HiSilicon Hip06 SoC
- i40iw: Intel Ethernet Connection X722 RDMA
- ipathverbs: QLogic InfiniPath HCAs
- mlx4: Mellanox ConnectX-3 InfiniBand HCAs
- mlx5: Mellanox Connect-IB/X-4+ InfiniBand HCAs
- mthca: Mellanox InfiniBand HCAs
- nes: Intel NetEffect NE020-based iWARP adapters
- ocrdma: Emulex OneConnect RDMA/RoCE device
- qedr: QLogic QL4xxx RoCE HCAs
- rxe: A software implementation of the RoCE protocol
- vmw_pvrdma: VMware paravirtual RDMA device
Description-md5: 64b42b06411dd091f1db96ec8c93d131
Task: samba-server, ubuntu-budgie-desktop
Supported: 5y
The rxe_cfg
script is not currently installed, but is reported as available:
mr_halfword@Haswell-Ubuntu:~$ rxe_cfg
Command 'rxe_cfg' not found, but can be installed with:
sudo apt install rdma-core
Scientific Linux release 6.10 with Kernel 3.10.33-rt32.33.el6rt.x86_64
has neither the siw
nor rdma_rxe
modules.
Can't see any user space providers for siw or rdma_rxe - the user space providers are split into individual packages.
Potential user space providers:
[mr_halfword@sandy-centos ~]$ ls /usr/lib64/*rdma*
/usr/lib64/libcxgb3-rdmav2.so /usr/lib64/libnes-rdmav2.so
/usr/lib64/libipathverbs-rdmav2.so /usr/lib64/librdmacm.so.1
/usr/lib64/libmlx4-rdmav2.so /usr/lib64/librdmacm.so.1.0.0
/usr/lib64/libmthca-rdmav2.so
Ubuntu 20.04.4 LTS with Kernel 5.4.0-122-generic
has modules rdma_rxe
and siw
.
The user space providers:
mr_halfword@haswell-ubuntu:~$ ls /usr/lib/x86_64-linux-gnu/libibverbs
libbnxt_re-rdmav25.so libi40iw-rdmav25.so libocrdma-rdmav25.so
libcxgb4-rdmav25.so libipathverbs-rdmav25.so libqedr-rdmav25.so
libefa-rdmav25.so libmlx4-rdmav25.so librxe-rdmav25.so
libhfi1verbs-rdmav25.so libmlx5-rdmav25.so libsiw-rdmav25.so
libhns-rdmav25.so libmthca-rdmav25.so libvmw_pvrdma-rdmav25.so
The description of the package for the providers:
mr_halfword@haswell-ubuntu:~$ apt-cache show ibverbs-providers
Package: ibverbs-providers
Architecture: amd64
Version: 28.0-1ubuntu1
Multi-Arch: same
Priority: optional
Section: net
Source: rdma-core
Origin: Ubuntu
Maintainer: Ubuntu Developers <[email protected]>
Original-Maintainer: Benjamin Drung <[email protected]>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 889
Provides: libefa1, libipathverbs1, libmlx4-1, libmlx5-1, libmthca1
Depends: libc6 (>= 2.14), libibverbs1 (>= 25)
Breaks: libipathverbs1 (<< 15), libmlx4-1 (<< 15), libmlx5-1 (<< 15), libmthca1 (<< 15)
Replaces: libipathverbs1 (<< 15), libmlx4-1 (<< 15), libmlx5-1 (<< 15), libmthca1 (<< 15)
Filename: pool/main/r/rdma-core/ibverbs-providers_28.0-1ubuntu1_amd64.deb
Size: 232488
MD5sum: 60767d1fb12066de804772ff2b995e02
SHA1: 4624689436fc5b3d83ca107cebebca97c7667af0
SHA256: df4907eee048b15c2816ad72262a3867fe89810e1d0353410243ccb07d989c55
Homepage: https://github.com/linux-rdma/rdma-core
Description-en_GB: User space provider drivers for libibverbs
libibverbs is a library that allows userspace processes to use RDMA
"verbs" as described in the InfiniBand Architecture Specification and the
RDMA Protocol Verbs Specification. iWARP ethernet NICs support RDMA over
hardware-offloaded TCP/IP, while InfiniBand is a high-throughput, low-
latency networking technology. InfiniBand host channel adapters (HCAs)
and iWARP NICs commonly support direct hardware access from userspace
(kernel bypass), and libibverbs supports this when available.
.
A RDMA driver consists of a kernel portion and a user space portion. This
package contains the user space verbs drivers:
.
- bnxt_re: Broadcom NetXtreme-E RoCE HCAs
- cxgb4: Chelsio T4 iWARP HCAs
- efa: Amazon Elastic Fabric Adapter
- hfi1verbs: Intel Omni-Path HFI
- hns: HiSilicon Hip06 SoC
- i40iw: Intel Ethernet Connection X722 RDMA
- ipathverbs: QLogic InfiniPath HCAs
- mlx4: Mellanox ConnectX-3 InfiniBand HCAs
- mlx5: Mellanox Connect-IB/X-4+ InfiniBand HCAs
- mthca: Mellanox InfiniBand HCAs
- ocrdma: Emulex OneConnect RDMA/RoCE device
- qedr: QLogic QL4xxx RoCE HCAs
- rxe: A software implementation of the RoCE protocol
- siw: A software implementation of the iWarp protocol
- vmw_pvrdma: VMware paravirtual RDMA device
Description-md5: 9721015313d569f3a2a4e5be9c7c4152
Task: samba-server, ubuntustudio-video
For an initial test tried:
- Running Soft-RoCE in an AlmaLinux 8.6 PC which only had an on-board 1GbE Intel 82579V network device, with no RDMA hardware support.
- Using a PC running Ubuntu 18.04 with a ConnectX-2 VPI adapter fitted, with the link layer set to Ethernet. This has RoCEv1 RDMA support.
On the AlmaLinux PC after had booted, as expected, initially no RDMA devices:
[mr_halfword@haswell-alma ~]$ rdma link show
<<<no output>>>
Add the rxe0 soft-RoCE device on the 1GbE network interface eno1
:
[mr_halfword@haswell-alma ~]$ sudo rdma link add rxe0 type rxe netdev eno1
[sudo] password for mr_halfword:
Which shows up:
[mr_halfword@haswell-alma ~]$ rdma link show
link rxe0/1 state ACTIVE physical_state LINK_UP netdev eno1
The mtu on the underlying eno1
netdev has been left at the default of 1500, so as expected the RoCE device has a MTU of 1024:
[mr_halfword@haswell-alma ~]$ ibv_devinfo
hca_id: rxe0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0222:15ff:fea9:f56b
sys_image_guid: 0222:15ff:fea9:f56b
vendor_id: 0xffffff
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
The gids on the rexe0
Soft-RoCE device are all RoCEv2:
[mr_halfword@haswell-alma ~]$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
rxe0 1 0 fe80:0000:0000:0000:0222:15ff:fea9:f56b v2 eno1
rxe0 1 1 0000:0000:0000:0000:0000:ffff:c0a8:0078 192.168.0.120 v2 eno1
n_gids_found=2
Since the intended ConnectX-2 VPI only supports RoCEv1 it can't communicate with the rexe0
Soft-RoCE device.
man rdma_rxe
says:
The rdma_rxe kernel module provides a software implementation of the RoCEv2 protocol. The RoCEv2 protocol is an RDMA transport protocol that exists on top of UDP/IPv4 or UDP/IPv6. The InfiniBand (IB) Base Trans‐ port Header (BTH) is encapsulated in the UDP packet.
For this test reversed the role of the two PCs:
- Fitted a Mellanox ConnectX-4 Lx to the AlmaLinux 8.6 PC, which supports hardware RDMA with RoCEv1 or RoCEv2.
- Used the Ubuntu 18.04 PC to run Soft-RoCE. The ConnectX-2 VPI fitted was left at it's reset default of ports in Infiniband mode. The on-board 1GbE Intel I218-LM network interface is used for Soft-RoCE
On the PC with the ConnectX-4 bring up the links:
[mr_halfword@haswell-alma ibv_message_passing]$ sudo ./alma_mlx5_eth_setup.sh
[sudo] password for mr_halfword:
Waiting for enp1s0f0 duplicate address detection to complete
Waiting for enp1s0f1 duplicate address detection to complete
[mr_halfword@haswell-alma ibv_message_passing]$ rdma link show
link rocep1s0f0/1 state ACTIVE physical_state LINK_UP netdev enp1s0f0
link rocep1s0f1/1 state ACTIVE physical_state LINK_UP netdev enp1s0f1
[mr_halfword@haswell-alma ibv_message_passing]$ ibv_devinfo
hca_id: rocep1s0f0
transport: InfiniBand (0)
fw_ver: 14.32.1010
node_guid: 9803:9b03:0077:e152
sys_image_guid: 9803:9b03:0077:e152
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110004
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: rocep1s0f1
transport: InfiniBand (0)
fw_ver: 14.32.1010
node_guid: 9803:9b03:0077:e153
sys_image_guid: 9803:9b03:0077:e152
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110004
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
Which has RoCEv1 and RoCEv2 gids:
[mr_halfword@haswell-alma ibv_message_passing]$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
rocep1s0f0 1 0 fe80:0000:0000:0000:9a03:9bff:fe77:e152 v1 enp1s0f0
rocep1s0f0 1 1 fe80:0000:0000:0000:9a03:9bff:fe77:e152 v2 enp1s0f0
rocep1s0f1 1 0 fe80:0000:0000:0000:9a03:9bff:fe77:e153 v1 enp1s0f1
rocep1s0f1 1 1 fe80:0000:0000:0000:9a03:9bff:fe77:e153 v2 enp1s0f1
n_gids_found=4
On the PC for Soft-RoCE initially only the Infiniband RDMA devices (links down as not connected to an an Infiniband switch):
mr_halfword@Haswell-Ubuntu:~$ rdma link show
1/1: mlx4_0/1: subnet_prefix fe80:0000:0000:0000 lid 0 sm_lid 0 lmc 0 state DOWN physical_state POLLING
1/2: mlx4_0/2: subnet_prefix fe80:0000:0000:0000 lid 0 sm_lid 0 lmc 0 state DOWN physical_state POLLING
Attempted to run rdma link add
as had done with AlmaLinux 8.6, but that command isn't supported on the version of the rdma
command installed with Ubuntu 18.04:
mr_halfword@Haswell-Ubuntu:~$ sudo rdma link add rxe0 type rxe netdev eno1
[sudo] password for mr_halfword:
Sorry, try again.
mr_halfword@Haswell-Ubuntu:~$ rdma link help
Usage: rdma link show [DEV/PORT_INDEX]
Therefore, install the package to get the rxe_cfg
script on Ubuntu 18.04:
mr_halfword@Haswell-Ubuntu:~$ sudo apt install rdma-core
[sudo] password for mr_halfword:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
binutils-aarch64-linux-gnu cpp-8-aarch64-linux-gnu
gcc-8-aarch64-linux-gnu-base gcc-8-cross-base libasan5-arm64-cross
libatomic1-arm64-cross libc6-arm64-cross libc6-dev-arm64-cross
libgcc-8-dev-arm64-cross libgcc1-arm64-cross libgnat-8-arm64-cross
libgomp1-arm64-cross libitm1-arm64-cross libllvm6.0:i386 libllvm7
libllvm7:i386 libllvm8 libllvm8:i386 libllvm9 libllvm9:i386
liblsan0-arm64-cross libminizip1 libqt4-opengl libqtwebkit4
libstdc++6-arm64-cross libtsan0-arm64-cross libubsan1-arm64-cross
linux-libc-dev-arm64-cross shim
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed
rdma-core
0 to upgrade, 1 to newly install, 0 to remove and 1 not to upgrade.
Need to get 56.7 kB of archives.
After this operation, 195 kB of additional disk space will be used.
Get:1 http://gb.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 rdma-core amd64 17.1-1ubuntu0.2 [56.7 kB]
Fetched 56.7 kB in 0s (748 kB/s)
Selecting previously unselected package rdma-core.
(Reading database ... 309037 files and directories currently installed.)
Preparing to unpack .../rdma-core_17.1-1ubuntu0.2_amd64.deb ...
Unpacking rdma-core (17.1-1ubuntu0.2) ...
Setting up rdma-core (17.1-1ubuntu0.2) ...
rdma-hw.target is a disabled or a static unit, not starting it.
rdma-ndd.service is a disabled or a static unit, not starting it.
Processing triggers for systemd (237-3ubuntu10.53) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for ureadahead (0.100.0-21) ...
ureadahead will be reprofiled on next reboot
Initially the rxe_cfg
script shows no Soft-RoCE device:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg
[sudo] password for mr_halfword:
rdma_rxe module not loaded
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e
Started rxe, which loaded the rdma_rxe
module:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg start
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e
mr_halfword@Haswell-Ubuntu:~$ lsmod | grep rxe
rdma_rxe 114688 0
ip6_udp_tunnel 16384 1 rdma_rxe
udp_tunnel 16384 1 rdma_rxe
ib_core 225280 14 rdma_cm,ib_ipoib,rdma_rxe,iw_cxgb4,rpcrdma,mlx4_ib,iw_cm,ib_iser,ib_umad,iw_cxgb3,iw_nes,rdma_ucm,ib_uverbs,ib_cm
Add a rxe device to the 1GbE network device eno1
:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg add eno1
Which shows up:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg status
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e rxe0 1024 (3)
Since the underlying network device eno1
has the default mtu of 1500, as expected the MTU on the rxe0 device is 1024:
mr_halfword@Haswell-Ubuntu:~$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0002:c903:0050:4174
sys_image_guid: 0002:c903:0050:4177
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0FC0110009
phys_port_cnt: 2
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
hca_id: rxe0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: eeb1:d7ff:fe3e:5ff1
sys_image_guid: eeb1:d7ff:fe3e:5ff1
vendor_id: 0x0000
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
There are multiple RoCEv2 gids allocated to the rxe0 device:
mr_halfword@Haswell-Ubuntu:~$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx4_0 1 0 fe80:0000:0000:0000:0002:c903:0050:4175
mlx4_0 2 0 fe80:0000:0000:0000:0002:c903:0050:4176
rxe0 1 0 fe80:0000:0000:0000:eeb1:d7ff:fe3e:5ff1 v2 eno1
rxe0 1 1 0000:0000:0000:0000:0000:ffff:c0a8:0076 192.168.0.118 v2 eno1
rxe0 1 2 fe80:0000:0000:0000:ac79:9ebc:a4c0:a3c2 v2 eno1
n_gids_found=5
gid index zero has been allocated based upon the MAC address of the underlying network device eno1:
mr_halfword@Haswell-Ubuntu:~$ ip addr show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether ec:b1:d7:3e:5f:f1 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.118/24 brd 192.168.0.255 scope global noprefixroute eno1
valid_lft forever preferred_lft forever
inet6 fe80::ac79:9ebc:a4c0:a3c2/64 scope link noprefixroute
valid_lft forever preferred_lft forever
After bringing up the Soft-RoCE device on the Ubuntu PC as above tried to run the ibv_message_bw test program.
On the AlmaLinux PC with the ConnectX-4:
[mr_halfword@haswell-alma debug]$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rocep1s0f0 --ib-port=1
Rx_0 connected
On the Ubuntu PC with Soft-RoCE:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/b
in/debug$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rxe0 --ib-port=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Both ends successfully initialised the RDMA Queue-Pairs, but no data was exchanged.
With the ibw_message_bw initialised, but not transferring any data checked for errors. On the AlmaLinux PC:
[mr_halfword@haswell-alma debug]$ ibv_display_local_infiniband_device_statistics/ibv_display_local_infiniband_device_statistics
Press Ctrl-C to stop the Infiniband port statistics collection
Statistics after 10 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 32576(+1184)
port_xmit_data rocep1s0f0 1 11885
port_rcv_packets rocep1s0f0 1 32576(+1184)
unicast_xmit_packets rocep1s0f0 1 594
port_rcv_data rocep1s0f0 1 895840(+32560)
port_xmit_packets rocep1s0f0 1 594
local_ack_timeout_err rocep1s0f0 1 5
req_cqe_error rocep1s0f0 1 16
req_cqe_flush_error rocep1s0f0 1 15
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 16
On the Ubuntu PC:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/debug$ ibv_display_local_infiniband_device_statistics/ibv_display_local_infiniband_device_statistics
Press Ctrl-C to stop the Infiniband port statistics collection
Statistics after 10 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 1387(+74)
sent_pkts rxe0 1 22208(+1184)
Therefore, the Soft-RoCE device is reporting increasing retry errors.
Realised that when the ibw_message_bw program was first modified to support RoCE it used a fixed GID index of zero.
On the two PCs:
- On the one with the ConnectX-4 GID index zero is RoCEv1
- On the one with the Soft-RoCE GID index zero is RoCEv2
Both ends trying to use a different RoCE protocol version probably explains the issue.
In 7.1 when initially created the rxe0
device the GID indices were:
mr_halfword@Haswell-Ubuntu:~$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx4_0 1 0 fe80:0000:0000:0000:0002:c903:0050:4175
mlx4_0 2 0 fe80:0000:0000:0000:0002:c903:0050:4176
rxe0 1 0 fe80:0000:0000:0000:eeb1:d7ff:fe3e:5ff1 v2 eno1
rxe0 1 1 0000:0000:0000:0000:0000:ffff:c0a8:0076 192.168.0.118 v2 eno1
rxe0 1 2 fe80:0000:0000:0000:ac79:9ebc:a4c0:a3c2 v2 eno1
n_gids_found=5
At the time of initial rxe0
creation the network cable was connected to a PlusOne Hub One router, and the underlying network device has autonegotiated the puase as off:
mr_halfword@Haswell-Ubuntu:~$ ethtool -a eno1
Pause parameters for eno1:
Autonegotiate: on
RX: off
TX: off
The PlusNet Hub One router doesn't seem to have an interface which allows the pause in the built in 4 port GbE switch to be examined or changed.
Therefore, unplugged the cable and plugged into a port of a tp-link T1700-28TQ in which flow-control had been enabled. Tne underlying network device then autonegotiated to pause on:
mr_halfword@Haswell-Ubuntu:~$ ethtool -a eno1
Pause parameters for eno1:
Autonegotiate: on
RX: on
TX: on
However, the GID indices had changed followed taking the link down and then back up. GID indices 1 and 2 have reversed:
mr_halfword@Haswell-Ubuntu:~$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx4_0 1 0 fe80:0000:0000:0000:0002:c903:0050:4175
mlx4_0 2 0 fe80:0000:0000:0000:0002:c903:0050:4176
rxe0 1 0 fe80:0000:0000:0000:eeb1:d7ff:fe3e:5ff1 v2 eno1
rxe0 1 1 fe80:0000:0000:0000:ac79:9ebc:a4c0:a3c2 v2 eno1
rxe0 1 2 0000:0000:0000:0000:0000:ffff:c0a8:0076 192.168.0.118 v2 eno1
n_gids_found=5
The ibv_message_bw program has been modified to take a --ib-gid
parameter to select the GID index to be used.
On the Ubuntu PC with the rxe0 Soft-RoCE device initially tried GID index zero which is RoCEv2:
in/debug$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
And when tried to connect the AlmaLinux PC with the ConnectX-4 using GID index one which is also RoCEv2 got the error:
[mr_halfword@haswell-alma debug]$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
ibv_modify_qp message_send_qp failed: Connection timed out
As the previous section, the rxe0 interface has two IPv6 GID indices:
- 0 : which is link-scope and based upon the MAC address of the underlying eno1 device
- 1 : which is link-scope but not sure what has been allocated from
On the underlying device eno1 only the IPv6 MAC address for GID index 1 is assigned:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/debug$ ip addr show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether ec:b1:d7:3e:5f:f1 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.118/24 brd 192.168.0.255 scope global noprefixroute eno1
valid_lft forever preferred_lft forever
inet6 fe80::ac79:9ebc:a4c0:a3c2/64 scope link noprefixroute
valid_lft forever preferred_lft forever
On the AlmaLinux PC can ping the IPv6 address for the Ubuntu PC GID index one:
[mr_halfword@haswell-alma debug]$ ping fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0
PING fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0(fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0) 56 data bytes
64 bytes from fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0: icmp_seq=1 ttl=64 time=0.222 ms
64 bytes from fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0: icmp_seq=2 ttl=64 time=0.223 ms
64 bytes from fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0: icmp_seq=3 ttl=64 time=0.228 ms
^C
--- fe80::ac79:9ebc:a4c0:a3c2%enp1s0f0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2038ms
rtt min/avg/max/mdev = 0.222/0.224/0.228/0.012 ms
But not the IPv6 address for the Ubuntu PC GID index zero:
[mr_halfword@haswell-alma debug]$ ping fe80::eeb1:d7ff:fe3e:5ff1%enp1s0f0
PING fe80::eeb1:d7ff:fe3e:5ff1%enp1s0f0(fe80::eeb1:d7ff:fe3e:5ff1%enp1s0f0) 56 data bytes
^C
--- fe80::eeb1:d7ff:fe3e:5ff1%enp1s0f0 ping statistics ---
29 packets transmitted, 0 received, 100% packet loss, time 28703ms
With the Ubuntu PC using GID index 1 (RoCEv2 with a reachable IPV6 address):
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/debug$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=1
Rx_0 connected
Rx_0 received 295698432 data bytes in 282 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 Total data bytes 2307915776 over 20.298034 seconds; 113.7 Mbytes/second
Rx_0 Total messages 2201 over 20.298034 seconds; 108 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=39 (4182 -> 4221)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (43 -> 43)
Rx_0 involuntary context switches=1650 (0 -> 1650)
Rx_0 user time=13.789141 system time=6.522378
And with the AlmaLinux PC using GID index 1 (RoCEv2) communication was possible:
[mr_halfword@haswell-alma debug]$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 1003487232 data bytes in 957 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 2307915776 over 20.319685 seconds; 113.6 Mbytes/second
Tx_0 Total messages 2201 over 20.319685 seconds; 108 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=1 (4229 -> 4230)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (63 -> 63)
Tx_0 involuntary context switches=147 (0 -> 147)
Tx_0 user time=20.313240 system time=0.000000
7.5 Changing NetworkManger configuration to generat IPv6 link-scope address based upon the MAC address
On the Ubuntu PC the underlying network device eno1 has addr_gen_mode
set to 1:
mr_halfword@Haswell-Ubuntu:~$ sysctl net.ipv6.conf.eno1.addr_gen_mode
net.ipv6.conf.eno1.addr_gen_mode = 1
According to ip-sysctl.txt this means:
1: do no generate a link-local address, use EUI64 for addresses generated from autoconf
Tried changing'net.ipv6.conf.eno1.addr_gen_mode
to 0, and taking the interface down and back up. But net.ipv6.conf.eno1.addr_gen_mode
was restored to 1.
What is setting my IPv6 addr_gen_mode? explains this is the behaviour of NetworkManage which is controlling the eno1
interface.
How do I get a stable IPv6 address in 16.04? has some more information.
Edited the /etc/NetworkManager/system-connections/on_board_ethernet
file for the connection, and in the [ipv6]
section changed from:
addr-gen-mode=stable-privacy
To:
addr-gen-mode=eui64
After re-booting when used rxe_cfg start
got some errors:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg start
[sudo] password for mr_halfword:
sh: line 0: echo: write error: Invalid argument
sh: line 0: echo: write error: Invalid argument
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e rxe0 1024 (3)
Realised that when previously had used some incorrect arguments to rxe_cfg
hadn't used the -n
option to indicate not a persistent change, and therefore the file which stores persistence had some incorrect underlying network device names stored:
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg persistent
rxe0
igb_1
eno1
Therefore, removed the persistent file and then rebooted:
mr_halfword@Haswell-Ubuntu:~$ sudo rm /var/lib/rxe/rxe
After a reboot, and before the rxe0
Soft-RoCE device had been created, the eno1
device now had a IPv6 link-scope address based upon the MAC address:
mr_halfword@Haswell-Ubuntu:~$ ip addr show dev eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether ec:b1:d7:3e:5f:f1 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.118/24 brd 192.168.0.255 scope global noprefixroute eno1
valid_lft forever preferred_lft forever
inet6 fe80::eeb1:d7ff:fe3e:5ff1/64 scope link noprefixroute
valid_lft forever preferred_lft forever
Loaded the RXE modules (with no persistent instances):
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg start
[sudo] password for mr_halfword:
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e
Added rxe0
as a Soft-RoCE device using the underlying device eno1
(using -n
so not added as persistent instance):
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg -n add eno1
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg status
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e rxe0 1024 (3)
mr_halfword@Haswell-Ubuntu:~$ sudo rxe_cfg persistent
<<<no output>>>
Now, GID index 0 is based upon the MAC address (and is the only IPv6 gid):
mr_halfword@Haswell-Ubuntu:~$ ~/mlnx-tools/ofed_scripts/show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx4_0 1 0 fe80:0000:0000:0000:0002:c903:0050:4175
mlx4_0 2 0 fe80:0000:0000:0000:0002:c903:0050:4176
rxe0 1 0 fe80:0000:0000:0000:eeb1:d7ff:fe3e:5ff1 v2 eno1
rxe0 1 1 0000:0000:0000:0000:0000:ffff:c0a8:0076 192.168.0.118 v2 eno1
n_gids_found=4
And a test is successful, at the Soft-RoCE end:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/b
in/debug$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
Rx_0 connected
Rx_0 received 740294656 data bytes in 706 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 Total data bytes 3358588928 over 29.542623 seconds; 113.7 Mbytes/second
Rx_0 Total messages 3203 over 29.542623 seconds; 108 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=66 (4182 -> 4248)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (102 -> 102)
Rx_0 involuntary context switches=2193 (1 -> 2194)
Rx_0 user time=20.879298 system time=8.675730
And at the ConnectX-4 end:
[mr_halfword@haswell-alma debug]$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 996147200 data bytes in 950 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 3358588928 over 29.564085 seconds; 113.6 Mbytes/second
Tx_0 Total messages 3203 over 29.564085 seconds; 108 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=2 (4232 -> 4234)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (63 -> 63)
Tx_0 involuntary context switches=15 (0 -> 15)
Tx_0 user time=29.614262 system time=0.000000
For this was using:
- AlmaLinux 8.6 PC with a ConnectX-4 Lx connected to a tp-link T1700G-28TQ switch with a 10G SFP+ port. While a dual port card, only one port was used for this test.
- Ubuntu 18.04 PC using Soft-RoCE over the on-board 1GbE RJ45 port, connected to the same switch.
The ib_message_bw program was used to measure the achieved throughput when using the default options (1 MiB messages).
With the Soft-RoCE end having a 1GbE port, and the ConnectX-4 end having a 10GbE port, the mismatch in port bandwidth is a way of testing the effect of Ethernet packet loss on throughput.
The effect of the MTU on the achieve throughput wasn't investigated in this test. Left as:
- In the tp-link T1700G-28TQ switch the jumbo packet size was left at the maximum of 9216 bytes
- In the ConnectX-4 the alma_mlx5_eth_setup.sh set the mtu on the ConnectX-4 ports to 9216 (max in the switch) and so the RoCE MTU was the max of 4096.
- The Intel I218-LM used for the Soft-RoCE had it's mtu as the default of 1500, so the RoCE MTU on the
rxe0
device is 1024 - The ib_message_bw program ends up using a MTU of 1024, since that is the largest supported by both endpoints.
In the T1700G-28TQ switch Flow Control was set to disabled on all ports.
The ConnectX-4 has pause Autonegotiate set to off, and ethtool reports Invalid argument
if try and turn autonegotiation on. The RX and TX pause is on by default, with with flow control disabled in the switch the RX and TX pause won't have any effect:
[mr_halfword@haswell-alma ~]$ ethtool -a enp1s0f0
Pause parameters for enp1s0f0:
Autonegotiate: off
RX: on
TX: on
The underlying eno1
for the Soft-RoCE has Autonegotiate set to on, and the RX and TX pause have negotiated to off:
mr_halfword@Haswell-Ubuntu:~$ ethtool -a eno1
Pause parameters for eno1:
Autonegotiate: on
RX: off
TX: off
Transmit end from ibv_message_bw:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 60817408 data bytes in 58 messages over last 10.0 seconds
Tx_0 transmitted 51380224 data bytes in 49 messages over last 10.0 seconds
Tx_0 transmitted 50331648 data bytes in 48 messages over last 10.0 seconds
Tx_0 transmitted 47185920 data bytes in 45 messages over last 10.0 seconds
Tx_0 transmitted 51380224 data bytes in 49 messages over last 10.0 seconds
Tx_0 transmitted 40894464 data bytes in 39 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 324009984 over 64.933925 seconds; 5.0 Mbytes/second
Tx_0 Total messages 309 over 64.933925 seconds; 5 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=0 (4230 -> 4230)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (63 -> 63)
Tx_0 involuntary context switches=135 (0 -> 135)
Tx_0 user time=64.967114 system time=0.001380
Receive end from ibv_message_bw:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/b
in/release$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
Rx_0 connected
Rx_0 received 32505856 data bytes in 31 messages over last 10.0 seconds
Rx_0 received 50331648 data bytes in 48 messages over last 10.0 seconds
Rx_0 received 51380224 data bytes in 49 messages over last 10.0 seconds
Rx_0 received 47185920 data bytes in 45 messages over last 10.0 seconds
Rx_0 received 51380224 data bytes in 49 messages over last 10.0 seconds
Rx_0 received 50331648 data bytes in 48 messages over last 10.0 seconds
Rx_0 Total data bytes 324009984 over 64.890333 seconds; 5.0 Mbytes/second
Rx_0 Total messages 309 over 64.890333 seconds; 5 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=72 (4183 -> 4255)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (36 -> 36)
Rx_0 involuntary context switches=1494 (0 -> 1494)
Rx_0 user time=59.185859 system time=5.744197
Transmit end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 110 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 46402(+6600)
port_xmit_data rocep1s0f0 1 17566604380(+2352547480)
port_rcv_packets rocep1s0f0 1 46402(+6600)
unicast_xmit_packets rocep1s0f0 1 63820008(+8546593)
port_rcv_data rocep1s0f0 1 958253(+136686)
port_xmit_packets rocep1s0f0 1 63821145(+8546864)
local_ack_timeout_err rocep1s0f0 1 4(+3)
packet_seq_err rocep1s0f0 1 9880(+1317)
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 308(+39)
Receive end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 100 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 332(+57)
duplicate_request rxe0 1 2883(+33)
rcvd_pkts rxe0 1 6197050(+1017295)
sent_pkts rxe0 1 43257(+7225)
out_of_sequence rxe0 1 9218(+1513)
This shows issues caused by packet loss (from the 10G transmit port to a 1G receive port) in that:
- The statistics counters for both ends show errors due to lost packets at a high rate -
packet_seq_err
andpacket_seq_err
both increasing at around 1000 per second. - The achieved payload throughput is only 5 Mbytes/secs over a 1Gbit link.
This reverses the direction of the previous test.
Transmit end from ibv_message_bw:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 877658112 data bytes in 837 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 8248098816 over 73.001395 seconds; 113.0 Mbytes/second
Tx_0 Total messages 7866 over 73.001395 seconds; 108 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=80 (4182 -> 4262)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (31 -> 31)
Tx_0 involuntary context switches=3786 (0 -> 3786)
Tx_0 user time=57.949088 system time=15.012900
Receive end from ibv_message_bw:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Rx_0 connected
Rx_0 received 982515712 data bytes in 937 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 Total data bytes 8248098816 over 72.991742 seconds; 113.0 Mbytes/second
Rx_0 Total messages 7866 over 72.991742 seconds; 108 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=0 (4224 -> 4224)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (63 -> 63)
Rx_0 involuntary context switches=105 (0 -> 105)
Rx_0 user time=72.941151 system time=0.000000
Transmit end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 80 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 370
duplicate_request rxe0 1 4968
rcvd_pkts rxe0 1 7016375(+19323)
sent_pkts rxe0 1 7035150(+1105524)
out_of_sequence rxe0 1 10250
Receive end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 80 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 7789178(+1105539)
port_xmit_data rocep1s0f0 1 18206434905(+400577)
port_rcv_packets rocep1s0f0 1 7789323(+1105539)
unicast_xmit_packets rocep1s0f0 1 66267827(+19329)
port_rcv_data rocep1s0f0 1 2129763445(+304036970)
port_xmit_packets rocep1s0f0 1 66267829(+19330)
local_ack_timeout_err rocep1s0f0 1 4
packet_seq_err rocep1s0f0 1 10247
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 15430(+2156)
unicast_rcv_packets rocep1s0f1 1 11
port_rcv_packets rocep1s0f1 1 11
port_rcv_data rocep1s0f1 1 302
This shows:
- From the statistics counters no evidence of packet loss - the
out_of_sequence
andpacket_seq_err
counts didn't increment during the test - The achieved payload throughput is 113.0 Mbytes/secs over a 1Gbit link.
In the T1700G-28TQ switch set Flow Control to Enabled on all ports.
The underlying eno1
for the Soft-RoCE then changed to RX and TX pause on, which it autonegotiated with the switch:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ethtool -a eno1
Pause parameters for eno1:
Autonegotiate: on
RX: on
TX: on
The ConnectX-4 already has RX and TX pause on, so no change required.
Transmit end from ibv_message_bw:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 881852416 data bytes in 841 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
Tx_0 transmitted 1136656384 data bytes in 1084 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 7169114112 over 63.079404 seconds; 113.7 Mbytes/second
Tx_0 Total messages 6837 over 63.079404 seconds; 108 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=0 (4231 -> 4231)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (86 -> 86)
Tx_0 involuntary context switches=228 (0 -> 228)
Tx_0 user time=63.013133 system time=0.001766
Receive end from ibv_message_bw:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
Rx_0 connected
Rx_0 received 987758592 data bytes in 942 messages over last 10.0 seconds
Rx_0 received 1135607808 data bytes in 1083 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 received 1136656384 data bytes in 1084 messages over last 10.0 seconds
Rx_0 Total data bytes 7169114112 over 63.070394 seconds; 113.7 Mbytes/second
Rx_0 Total messages 6837 over 63.070394 seconds; 108 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=72 (4182 -> 4254)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (8 -> 8)
Rx_0 involuntary context switches=5215 (0 -> 5215)
Rx_0 user time=42.594097 system time=20.497811
Transmit end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 70 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 8875716(+140903)
port_xmit_data rocep1s0f0 1 19848198293(+305825828)
port_rcv_packets rocep1s0f0 1 8875735(+140904)
unicast_xmit_packets rocep1s0f0 1 72243083(+1112039)
port_rcv_data rocep1s0f0 1 2236039827(+2892848)
port_xmit_packets rocep1s0f0 1 72243179(+1112039)
local_ack_timeout_err rocep1s0f0 1 4
packet_seq_err rocep1s0f0 1 10247
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 21909(+1084)
unicast_rcv_packets rocep1s0f1 1 11
port_rcv_packets rocep1s0f1 1 11
port_rcv_data rocep1s0f1 1 302
Reveive end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 70 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 370
duplicate_request rxe0 1 4968
rcvd_pkts rxe0 1 12924182(+1112049)
sent_pkts rxe0 1 8865684(+140904)
out_of_sequence rxe0 1 10250
This shows:
- From the statistics counters no evidence of packet loss - the
out_of_sequence
andpacket_seq_err
counts didn't increment during the test - The achieved payload throughput is 113.7 Mbytes/secs over a 1Gbit link.
Transmit end from ibv_message_bw:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ibv_message_bw/ibv_message_bw --thread=tx:0 --ib-dev=rxe0 --ib-port=1 --ib-gid=0
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 922746880 data bytes in 880 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
Tx_0 transmitted 1129316352 data bytes in 1077 messages over last 10.0 seconds
Tx_0 transmitted 1130364928 data bytes in 1078 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 8882487296 over 78.616022 seconds; 113.0 Mbytes/second
Tx_0 Total messages 8471 over 78.616022 seconds; 108 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=80 (4184 -> 4264)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (23 -> 23)
Tx_0 involuntary context switches=4213 (0 -> 4213)
Tx_0 user time=61.854381 system time=16.716450
Receive end from ibv_message_bw:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=rx:0 --ib-dev=rocep1s0f0 --ib-port=1 --ib-gid=1
Rx_0 connected
Rx_0 received 982515712 data bytes in 937 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 received 1130364928 data bytes in 1078 messages over last 10.0 seconds
Rx_0 received 1129316352 data bytes in 1077 messages over last 10.0 seconds
Rx_0 Total data bytes 8882487296 over 78.606417 seconds; 113.0 Mbytes/second
Rx_0 Total messages 8471 over 78.606417 seconds; 108 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=0 (4225 -> 4225)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (63 -> 63)
Rx_0 involuntary context switches=342 (0 -> 342)
Rx_0 user time=78.549109 system time=0.000000
Transmit end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 370
duplicate_request rxe0 1 4968
rcvd_pkts rxe0 1 14191842(+19147)
sent_pkts rxe0 1 17201033(+1105525)
out_of_sequence rxe0 1 10250
Recveive end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 17125149(+1105531)
port_xmit_data rocep1s0f0 1 20138620474(+396661)
port_rcv_packets rocep1s0f0 1 17125298(+1105534)
unicast_xmit_packets rocep1s0f0 1 73428793(+19139)
port_rcv_data rocep1s0f0 1 4471079385(+304036420)
port_xmit_packets rocep1s0f0 1 73428794(+19138)
local_ack_timeout_err rocep1s0f0 1 4
packet_seq_err rocep1s0f0 1 10247
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 38767(+2156)
unicast_rcv_packets rocep1s0f1 1 11
port_rcv_packets rocep1s0f1 1 11
port_rcv_data rocep1s0f1 1 302
This shows:
- From the statistics counters no evidence of packet loss - the
out_of_sequence
andpacket_seq_err
counts didn't increment during the test - The achieved payload throughput is 113.0 Mbytes/secs over a 1Gbit link.
Transmit and receive from Soft-RoCE end:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ibv_message_bw/ibv_message_bw --thread=tx:0,rx:1 --ib-dev=rxe0,rxe0 --ib-port=1,1 --ib-gid=0,0
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 984612864 data bytes in 939 messages over last 10.0 seconds
Rx_1 connected
Rx_1 received 1074790400 data bytes in 1025 messages over last 10.0 seconds
Tx_0 transmitted 1115684864 data bytes in 1064 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
Tx_0 transmitted 1114636288 data bytes in 1063 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
Tx_0 transmitted 1114636288 data bytes in 1063 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
Tx_0 transmitted 1115684864 data bytes in 1064 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
Tx_0 transmitted 1114636288 data bytes in 1063 messages over last 10.0 seconds
Rx_1 received 1135607808 data bytes in 1083 messages over last 10.0 seconds
Tx_0 transmitted 1115684864 data bytes in 1064 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
Tx_0 transmitted 1114636288 data bytes in 1063 messages over last 10.0 seconds
Rx_1 received 1134559232 data bytes in 1082 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 8842641408 over 79.294690 seconds; 111.5 Mbytes/second
Tx_0 Total messages 8433 over 79.294690 seconds; 106 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=64 (4183 -> 4247)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (14 -> 14)
Tx_0 involuntary context switches=124 (1 -> 125)
Tx_0 user time=78.925014 system time=0.403050
Rx_1 Total data bytes 9034530816 over 79.606598 seconds; 113.5 Mbytes/second
Rx_1 Total messages 8616 over 79.606598 seconds; 108 messages/second
Rx_1 Min message size=1048576 max message size=1048576 data verification=no
Rx_1 minor page faults=44 (4113 -> 4157)
Rx_1 major page faults=0 (0 -> 0)
Rx_1 voluntary context switches=0 (24 -> 24)
Rx_1 involuntary context switches=254 (4 -> 258)
Rx_1 user time=78.964145 system time=0.648358
Transmit and receive from ConnectX-4 end:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=rx:0,tx:1 --ib-dev=rocep1s0f0,rocep1s0f0 --ib-port=1,1 --ib-gid=1,1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Rx_0 connected
Rx_0 received 882900992 data bytes in 842 messages over last 10.0 seconds
Tx_1 connected
Tx_1 transmitted 1003487232 data bytes in 957 messages over last 10.0 seconds
Rx_0 received 1114636288 data bytes in 1063 messages over last 10.0 seconds
Tx_1 transmitted 1135607808 data bytes in 1083 messages over last 10.0 seconds
Rx_0 received 1114636288 data bytes in 1063 messages over last 10.0 seconds
Tx_1 transmitted 1134559232 data bytes in 1082 messages over last 10.0 seconds
Rx_0 received 1115684864 data bytes in 1064 messages over last 10.0 seconds
Tx_1 transmitted 1134559232 data bytes in 1082 messages over last 10.0 seconds
Rx_0 received 1114636288 data bytes in 1063 messages over last 10.0 seconds
Tx_1 transmitted 1134559232 data bytes in 1082 messages over last 10.0 seconds
Rx_0 received 1115684864 data bytes in 1064 messages over last 10.0 seconds
Tx_1 transmitted 1134559232 data bytes in 1082 messages over last 10.0 seconds
Rx_0 received 1114636288 data bytes in 1063 messages over last 10.0 seconds
Tx_1 transmitted 1135607808 data bytes in 1083 messages over last 10.0 seconds
Rx_0 received 1114636288 data bytes in 1063 messages over last 10.0 seconds
Tx_1 transmitted 1134559232 data bytes in 1082 messages over last 10.0 seconds
^C
Rx_0 Total data bytes 8842641408 over 79.284979 seconds; 111.5 Mbytes/second
Rx_0 Total messages 8433 over 79.284979 seconds; 106 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=0 (4180 -> 4180)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (84 -> 84)
Rx_0 involuntary context switches=110 (0 -> 110)
Rx_0 user time=79.221968 system time=0.000332
Tx_1 Total data bytes 9034530816 over 79.615509 seconds; 113.5 Mbytes/second
Tx_1 Total messages 8616 over 79.615509 seconds; 108 messages/second
Tx_1 Min message size=1048576 max message size=1048576 data verification=no
Tx_1 minor page faults=0 (4232 -> 4232)
Tx_1 major page faults=0 (0 -> 0)
Tx_1 voluntary context switches=0 (64 -> 64)
Tx_1 involuntary context switches=811 (0 -> 811)
Tx_1 user time=79.543795 system time=0.000000
Soft-RoCE end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 400(+4)
duplicate_request rxe0 1 4968
rcvd_pkts rxe0 1 22842108(+1128443)
sent_pkts rxe0 1 27055139(+1232733)
out_of_sequence rxe0 1 10250
ConnectX-4 end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 26953314(+1232755)
port_xmit_data rocep1s0f0 1 22455284985(+305734526)
port_rcv_packets rocep1s0f0 1 26953492(+1232769)
unicast_xmit_packets rocep1s0f0 1 81987579(+1128463)
port_rcv_data rocep1s0f0 1 6900946250(+302949524)
port_xmit_packets rocep1s0f0 1 81987676(+1128463)
local_ack_timeout_err rocep1s0f0 1 4
packet_seq_err rocep1s0f0 1 10247
roce_adp_retrans rocep1s0f0 1 9
rx_write_requests rocep1s0f0 1 64065(+3208)
unicast_rcv_packets rocep1s0f1 1 11
port_rcv_packets rocep1s0f1 1 11
port_rcv_data rocep1s0f1 1 302
This shows:
- From the statistics counters
out_of_sequence
andpacket_seq_err
counts didn't increment during the test. However:- The
completer_retry_err
on the Soft-RoCE end increased by 31 during the test. - On the ConnectX-4 end the
local_ack_timeout_err
androce_adp_retrans
didn't increase during the test.
- The
- The achieved payload throughput is 111.5 Mbytes/second from Soft-RoCE to ConnectX-4
- The achieved payload throughput is 113.5 Mbytes/second from ConnectX-4 to Soft-RoCE
On the Ubuntu 18.04 PC set the mtu on the underlying eno1
device to 9000 (exact maximum not determined but rejected attempts to set 9216 or 9200):
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ sudo ip link set eno1 mtu 9000
The rxe0
device then reported a RoCE MTU of 4096:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ rxe_cfg status
Cannot get wake-on-lan settings: Operation not permitted
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eno1 yes e1000e rxe0 4096 (5)
And ibv_devinfo
also reported a MTU of 4096 for rxe0
Transmit and receive from Soft-RoCE end:
mr_halfword@Haswell-Ubuntu:~/ibv_message_passing/ibv_message_passing_c_project/bin/release$ ibv_message_bw/ibv_message_bw --thread=tx:0,rx:1 --ib-dev=rxe0,rxe0 --ib-port=1,1 --ib-gid=0,0
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Tx_0 connected
Tx_0 transmitted 1058013184 data bytes in 1009 messages over last 10.0 seconds
Rx_1 connected
Rx_1 received 995098624 data bytes in 949 messages over last 10.0 seconds
Tx_0 transmitted 1198522368 data bytes in 1143 messages over last 10.0 seconds
Rx_1 received 1218445312 data bytes in 1162 messages over last 10.0 seconds
Tx_0 transmitted 1198522368 data bytes in 1143 messages over last 10.0 seconds
Rx_1 received 1219493888 data bytes in 1163 messages over last 10.0 seconds
Tx_0 transmitted 1198522368 data bytes in 1143 messages over last 10.0 seconds
Rx_1 received 1218445312 data bytes in 1162 messages over last 10.0 seconds
Tx_0 transmitted 1198522368 data bytes in 1143 messages over last 10.0 seconds
Rx_1 received 1218445312 data bytes in 1162 messages over last 10.0 seconds
Tx_0 transmitted 1197473792 data bytes in 1142 messages over last 10.0 seconds
Rx_1 received 1218445312 data bytes in 1162 messages over last 10.0 seconds
Tx_0 transmitted 1198522368 data bytes in 1143 messages over last 10.0 seconds
Rx_1 received 1219493888 data bytes in 1163 messages over last 10.0 seconds
^C
Tx_0 Total data bytes 9316597760 over 77.722032 seconds; 119.9 Mbytes/second
Tx_0 Total messages 8885 over 77.722032 seconds; 114 messages/second
Tx_0 Min message size=1048576 max message size=1048576 data verification=no
Tx_0 minor page faults=99 (4182 -> 4281)
Tx_0 major page faults=0 (0 -> 0)
Tx_0 voluntary context switches=0 (12 -> 12)
Tx_0 involuntary context switches=4140 (1 -> 4141)
Tx_0 user time=61.310195 system time=16.405321
Rx_1 Total data bytes 9310306304 over 76.386658 seconds; 121.9 Mbytes/second
Rx_1 Total messages 8879 over 76.386658 seconds; 116 messages/second
Rx_1 Min message size=1048576 max message size=1048576 data verification=no
Rx_1 minor page faults=81 (4113 -> 4194)
Rx_1 major page faults=0 (0 -> 0)
Rx_1 voluntary context switches=0 (39 -> 39)
Rx_1 involuntary context switches=120 (1 -> 121)
Rx_1 user time=76.347896 system time=0.046456
Transmit and receive from ConnectX-4 end:
[mr_halfword@haswell-alma release]$ ibv_message_bw/ibv_message_bw --thread=rx:0,tx:1 --ib-dev=rocep1s0f0,rocep1s0f0 --ib-port=1,1 --ib-gid=1,1
Press Ctrl-C to tell the 1 transmit thread(s) to stop the test
Rx_0 connected
Rx_0 received 1098907648 data bytes in 1048 messages over last 10.0 seconds
Tx_1 connected
Tx_1 transmitted 1070596096 data bytes in 1021 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1219493888 data bytes in 1163 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1218445312 data bytes in 1162 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1218445312 data bytes in 1162 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1218445312 data bytes in 1162 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1219493888 data bytes in 1163 messages over last 10.0 seconds
Rx_0 received 1198522368 data bytes in 1143 messages over last 10.0 seconds
Tx_1 transmitted 1218445312 data bytes in 1162 messages over last 10.0 seconds
^C
Rx_0 Total data bytes 9316597760 over 77.713091 seconds; 119.9 Mbytes/second
Rx_0 Total messages 8885 over 77.713091 seconds; 114 messages/second
Rx_0 Min message size=1048576 max message size=1048576 data verification=no
Rx_0 minor page faults=0 (4180 -> 4180)
Rx_0 major page faults=0 (0 -> 0)
Rx_0 voluntary context switches=0 (83 -> 83)
Rx_0 involuntary context switches=195 (0 -> 195)
Rx_0 user time=77.639080 system time=0.000439
Tx_1 Total data bytes 9310306304 over 76.394978 seconds; 121.9 Mbytes/second
Tx_1 Total messages 8879 over 76.394978 seconds; 116 messages/second
Tx_1 Min message size=1048576 max message size=1048576 data verification=no
Tx_1 minor page faults=0 (4233 -> 4233)
Tx_1 major page faults=0 (0 -> 0)
Tx_1 voluntary context switches=0 (64 -> 64)
Tx_1 involuntary context switches=416 (0 -> 416)
Tx_1 user time=76.387299 system time=0.000664
Soft-RoCE end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
completer_retry_err rxe0 1 460(+5)
duplicate_request rxe0 1 4984
rcvd_pkts rxe0 1 25408759(+305586)
sent_pkts rxe0 1 30744506(+446544)
out_of_sequence rxe0 1 10250
ConnectX-4 end device statistics from ibv_display_local_infiniband_device_statistics during the test:
Statistics after 90 seconds
Counter name Device Port Counter value(delta)
unicast_rcv_packets rocep1s0f0 1 30713330(+446577)
port_xmit_data rocep1s0f0 1 24812593778(+310665151)
port_rcv_packets rocep1s0f0 1 30713404(+446580)
unicast_xmit_packets rocep1s0f0 1 84625913(+305584)
port_rcv_data rocep1s0f0 1 9271486253(+308473711)
port_xmit_packets rocep1s0f0 1 84625943(+305582)
roce_adp_retrans_to rocep1s0f0 1 1
local_ack_timeout_err rocep1s0f0 1 4
packet_seq_err rocep1s0f0 1 10247
roce_adp_retrans rocep1s0f0 1 10
rx_write_requests rocep1s0f0 1 90413(+3449)
unicast_rcv_packets rocep1s0f1 1 24
port_rcv_packets rocep1s0f1 1 24
port_rcv_data rocep1s0f1 1 660
This shows:
- Increasing the RoCEv2 MTU from 1024 to 4096 has increased the payload throughput:
- In the Soft-RoCE to ConnectX-4 direction from 111.5 to 119.9 Mbytes/second
- In the ConnectX-4 to Soft-RoCE direction from 113.5 to 121.9 Mbytes/second
- With a RoCEv2 MTU of 1024 the payload uses 89 to 90% of the raw bit of a 1G link (some variation in each direction)
- With a RoCEv2 MTU of 4096 the payload uses 96 to 97% of the raw bit of a 1G link
- From the statistics counters
out_of_sequence
andpacket_seq_err
counts didn't increment during the test. However:- On Soft-RoCE end the
completer_retry_err
by 31 and theduplicate_request
increased by 16 during the test. - On the ConnectX-4 end the
roce_adp_retrans_to
androce_adp_retrans
both increased by 1 during the test.
- On Soft-RoCE end the
On both this bi-directional test with a MTU of 4096, and the previous with a MTU of 1024, the RDMA error count increase at the Soft-RoCE end:
- Didn't seem to correlate with the error counts at the ConnectX-4 end
- In theory Ethernet pause should have stopped packet loss but wasn't monitoring other counters such as underlying network device
eno1
used for Soft-RoCE - Looking at the switch statistics can't see any counters which measure dropped packets
- Even though some RDMA errors were being reported no apparent significant effect on the achieved payload bandwidth.
If wanted to analyse in more detail, consider enhancing ibv_message_bw to report change in statistics counters on the interface(s) over the test duration.
For this was using:
- A Chelsio T420-CR in a AlmaLinux 8.6 PC, which has hardware iWARP offload. This has two 10G ports connected to a switch.
- For Soft-iWARP a 1GbE on-board Realtek Ethernet port in a Ubuntu 20.04 laptop. Connected to the same switch as the T420-CR.
- As iWARP needs
rdma_cm
to establish the connection, and the ibv_message_bw programs are currently using the verbs API directly, used programs fromrdmacm-utils
for the test.
Install the rdmacm-utils
package to get the rstream
utility:
mr_halfword@haswell-ubuntu:~$ sudo apt install rdmacm-utils
[sudo] password for mr_halfword:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
command-not-found-data cpp-4.8 gir1.2-accounts-1.0 gir1.2-gdata-0.0
gir1.2-signon-1.0 gnome-software-common guile-2.0-libs libargon2-0 libasan0
libavresample3 libavutil55 libboost-date-time1.65.1
libboost-filesystem1.65.1 libboost-iostreams1.65.1 libboost-system1.65.1
libbrlapi0.6 libcgi-fast-perl libcgi-pm-perl libclass-accessor-perl
libdns-export1100 libdouble-conversion1 libegl1-mesa libevent-2.1-6
libexempi3 libfcgi-perl libfile-copy-recursive-perl libgcc-4.8-dev libgdbm5
libgutenprint-common libgutenprint9 libicu60 libisc-export169 libisc169
libisl15 libjson-c3 libllvm10 liblouis14 liblouisutdml8 liblwres160
libmozjs-52-0 liboauth0 libopensm5a liborcus-0.13-0 libperl5.26 libpoppler73
libpostproc54 libprocps6 libqgsttools-p1 libqpdf21 libraw16 libreadline7
libreoffice-style-galaxy libsane1 libsignon-glib1 libstdc++-4.8-dev
libswresample2 libswscale4 libusbmuxd4 libvpx5 libwsutil9 libx264-152
libxml-simple-perl libzeitgeist-1.0-1 nplan printer-driver-gutenprint
python-talloc python3-asn1crypto python3-bs4 python3-feedparser
python3-html5lib python3-oauth python3-soupsieve python3-webencodings
python3-zope.interface qpdf unity-lens-applications unity-lens-files
unity-lens-music unity-lens-video unity-scope-calculator
unity-scope-chromiumbookmarks unity-scope-firefoxbookmarks
unity-scope-manpages unity-scope-openclipart unity-scope-tomboy
unity-scope-video-remote unity-scope-virtualbox unity-scope-yelp
unity-scopes-runner wireshark-gtk
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
librdmacm1
The following NEW packages will be installed
librdmacm1 rdmacm-utils
0 to upgrade, 2 to newly install, 0 to remove and 8 not to upgrade.
Need to get 135 kB of archives.
After this operation, 578 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 librdmacm1 amd64 28.0-1ubuntu1 [64.9 kB]
Get:2 http://gb.archive.ubuntu.com/ubuntu focal/universe amd64 rdmacm-utils amd64 28.0-1ubuntu1 [70.1 kB]
Fetched 135 kB in 0s (1,006 kB/s)
Selecting previously unselected package librdmacm1:amd64.
(Reading database ... 218123 files and directories currently installed.)
Preparing to unpack .../librdmacm1_28.0-1ubuntu1_amd64.deb ...
Unpacking librdmacm1:amd64 (28.0-1ubuntu1) ...
Selecting previously unselected package rdmacm-utils.
Preparing to unpack .../rdmacm-utils_28.0-1ubuntu1_amd64.deb ...
Unpacking rdmacm-utils (28.0-1ubuntu1) ...
Setting up librdmacm1:amd64 (28.0-1ubuntu1) ...
Setting up rdmacm-utils (28.0-1ubuntu1) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
And install ibverbs-utils
to get ibv_devices
:
mr_halfword@haswell-ubuntu:~$ sudo apt install ibverbs-utils
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
command-not-found-data cpp-4.8 gir1.2-accounts-1.0 gir1.2-gdata-0.0
gir1.2-signon-1.0 gnome-software-common guile-2.0-libs libargon2-0 libasan0
libavresample3 libavutil55 libboost-date-time1.65.1
libboost-filesystem1.65.1 libboost-iostreams1.65.1 libboost-system1.65.1
libbrlapi0.6 libcgi-fast-perl libcgi-pm-perl libclass-accessor-perl
libdns-export1100 libdouble-conversion1 libegl1-mesa libevent-2.1-6
libexempi3 libfcgi-perl libfile-copy-recursive-perl libgcc-4.8-dev libgdbm5
libgutenprint-common libgutenprint9 libicu60 libisc-export169 libisc169
libisl15 libjson-c3 libllvm10 liblouis14 liblouisutdml8 liblwres160
libmozjs-52-0 liboauth0 libopensm5a liborcus-0.13-0 libperl5.26 libpoppler73
libpostproc54 libprocps6 libqgsttools-p1 libqpdf21 libraw16 libreadline7
libreoffice-style-galaxy libsane1 libsignon-glib1 libstdc++-4.8-dev
libswresample2 libswscale4 libusbmuxd4 libvpx5 libwsutil9 libx264-152
libxml-simple-perl libzeitgeist-1.0-1 nplan printer-driver-gutenprint
python-talloc python3-asn1crypto python3-bs4 python3-feedparser
python3-html5lib python3-oauth python3-soupsieve python3-webencodings
python3-zope.interface qpdf unity-lens-applications unity-lens-files
unity-lens-music unity-lens-video unity-scope-calculator
unity-scope-chromiumbookmarks unity-scope-firefoxbookmarks
unity-scope-manpages unity-scope-openclipart unity-scope-tomboy
unity-scope-video-remote unity-scope-virtualbox unity-scope-yelp
unity-scopes-runner wireshark-gtk
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed
ibverbs-utils
0 to upgrade, 1 to newly install, 0 to remove and 8 not to upgrade.
Need to get 53.2 kB of archives.
After this operation, 275 kB of additional disk space will be used.
Get:1 http://gb.archive.ubuntu.com/ubuntu focal/universe amd64 ibverbs-utils amd64 28.0-1ubuntu1 [53.2 kB]
Fetched 53.2 kB in 0s (812 kB/s)
Selecting previously unselected package ibverbs-utils.
(Reading database ... 218162 files and directories currently installed.)
Preparing to unpack .../ibverbs-utils_28.0-1ubuntu1_amd64.deb ...
Unpacking ibverbs-utils (28.0-1ubuntu1) ...
Setting up ibverbs-utils (28.0-1ubuntu1) ...
Processing triggers for man-db (2.9.1-1) ...
On the Ubuntu 20.04 laptop, as expected, there were initially no RDMA devices:
mr_halfword@haswell-ubuntu:~$ rdma link show
<<<no output>>>
Load the siw
Kernel module:
mr_halfword@haswell-ubuntu:~$ sudo modprobe siw
Add a Soft-iWARP device using the 1GbE Ethernet device as the underlying network interface:
mr_halfword@haswell-ubuntu:~$ sudo rdma link add siw0 type siw netdev enp1s0
Which shows up:
mr_halfword@haswell-ubuntu:~$ rdma link show
link siw0/1 state ACTIVE physical_state LINK_UP netdev enp1s0
mr_halfword@haswell-ubuntu:~$ ibv_devices
device node GUID
------ ----------------
siw0 fc45965558c40000
mr_halfword@haswell-ubuntu:~$ ibv_devinfo
hca_id: siw0
transport: iWARP (1)
fw_ver: 0.0.0
node_guid: fc45:9655:58c4:0000
sys_image_guid: fc45:9655:58c4:0000
vendor_id: 0x626d74
vendor_part_id: 1
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 1024 (3)
active_mtu: invalid MTU (0)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
The IPv6 link-local address on the underlying 1GbE network interface is:
$ ip addr show dev enp1s0
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether fc:45:96:55:58:c4 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fe45:96ff:fe55:58c4/64 scope link
valid_lft forever preferred_lft forever
In the AlmaLinux PC bring up the T420-CR network interfaces:
$ sudo ~/ibv_message_passing/alma_cxgb4_eth_setup.sh
Waiting for enp1s0f4 duplicate address detection to complete
Waiting for enp1s0f4d1 duplicate address detection to complete
Which have IPv6 link-local addresses:
[mr_halfword@haswell-alma ~]$ ip addr show dev enp1s0f4
3: enp1s0f4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether 00:07:43:15:22:90 brd ff:ff:ff:ff:ff:ff
inet6 fe80::207:43ff:fe15:2290/64 scope link
valid_lft forever preferred_lft forever
[mr_halfword@haswell-alma ~]$ ip addr show dev enp1s0f4d1
4: enp1s0f4d1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether 00:07:43:15:22:98 brd ff:ff:ff:ff:ff:ff
inet6 fe80::207:43ff:fe15:2298/64 scope link
valid_lft forever preferred_lft forever
For this on the Ubuntu laptop with Soft-iWARP run rping
as a server:
$ rping -s -C 5 -a fe80::fe45:96ff:fe55:58c4%2
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
And on the AlmaLinux PC with the T420-CR run rping
as a client:
[mr_halfword@haswell-alma ~]$ rping -c -C 5 -a fe80::fe45:96ff:fe55:58c4%3
client DISCONNECT EVENT...
Both the client and server exit upon completion of the test.
A wireshark capture made from the enp1s01
underlying 1GbE interface on the laptop running Soft-iWARP showed an iWARP TCP connection was established, data exchanged and then disconnected.
ibv_display_local_infiniband_device_statistics
run on the AlmaLinux PC showed the following statistics from the T420-CR RDMA:
Statistics after 60 seconds
Counter name Device Port Counter value(delta)
ip6OutSegs iwp1s0f4 43(+43)
ip6InSegs iwp1s0f4 41(+41)
The wireshark capture for the iWARP TCP stream showed a total of 84 packets, from the initial SYN to the final ACK after termimation, which matches the 43+41 count of ip6OutSegs
and ip6InSegs
segments from the T420-CR RDMA statistics.
Tried reversing the roles of the client and server from the previous test.
And on the AlmaLinux PC with the T420-CR run rping
as a client, which doesn't exit:
[mr_halfword@haswell-alma ~]$ rping -s -C 5 -a fe80::207:43ff:fe15:2290%3
On the Ubuntu laptop with Soft-iWARP run rping
as a client, which reports an error and doesn't exit:
mr_halfword@haswell-ubuntu:~$ rping -c -C 5 -a fe80::207:43ff:fe15:2290%2
rdma_connect: Connection refused
connect error -1
From wireshark on the Ubuntu Laptop can see a TCP SVN request to the expected destination IPv6 address on the AlmaLinux PC, but get a TCP RST in response.
Noticed that iwpmd
(port mapping services for iWARP) was running on the AlmaLinux PC, but not on the Ubuntu Laptop with Soft-iWARP.
On the AlmaLinux PC sudo netstat -a -p
shows `iwpmdis listening to the UDP
0.0.0.0:sdp-portmapper``, which is port number 3935:
[mr_halfword@haswell-alma ~]$ grep sdp.portmapper /etc/services
sdp-portmapper 3935/tcp # SDP Port Mapper Protocol
sdp-portmapper 3935/udp # SDP Port Mapper Protocol
Looking back the at the wireshark capture from 10.4 prior to the TCP iWARP connection being established could see the AlmaLinux PC sending 3 packets to UDP port 3935 on the Ubuntu PC, which resulted in ICMPv6 Destination Unreachable (Port Unreachable)
On the Ubuntu Laptop installed rdma-core
to get iwpmd
:
mr_halfword@haswell-ubuntu:~$ sudo apt install rdma-core
[sudo] password for mr_halfword:
Sorry, try again.
[sudo] password for mr_halfword:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
command-not-found-data cpp-4.8 gir1.2-accounts-1.0 gir1.2-gdata-0.0
gir1.2-signon-1.0 gnome-software-common guile-2.0-libs libargon2-0 libasan0
libavresample3 libavutil55 libboost-date-time1.65.1
libboost-filesystem1.65.1 libboost-iostreams1.65.1 libboost-system1.65.1
libbrlapi0.6 libcgi-fast-perl libcgi-pm-perl libclass-accessor-perl
libdns-export1100 libdouble-conversion1 libegl1-mesa libevent-2.1-6
libexempi3 libfcgi-perl libfile-copy-recursive-perl libgcc-4.8-dev libgdbm5
libgutenprint-common libgutenprint9 libicu60 libisc-export169 libisc169
libisl15 libjson-c3 libllvm10 liblouis14 liblouisutdml8 liblwres160
libmozjs-52-0 liboauth0 libopensm5a liborcus-0.13-0 libperl5.26 libpoppler73
libpostproc54 libprocps6 libqgsttools-p1 libqpdf21 libraw16 libreadline7
libreoffice-style-galaxy libsane1 libsignon-glib1 libstdc++-4.8-dev
libswresample2 libswscale4 libusbmuxd4 libvpx5 libwsutil9 libx264-152
libxml-simple-perl libzeitgeist-1.0-1 nplan printer-driver-gutenprint
python-talloc python3-asn1crypto python3-bs4 python3-feedparser
python3-html5lib python3-oauth python3-soupsieve python3-webencodings
python3-zope.interface qpdf unity-lens-applications unity-lens-files
unity-lens-music unity-lens-video unity-scope-calculator
unity-scope-chromiumbookmarks unity-scope-firefoxbookmarks
unity-scope-manpages unity-scope-openclipart unity-scope-tomboy
unity-scope-video-remote unity-scope-virtualbox unity-scope-yelp
unity-scopes-runner wireshark-gtk
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed
rdma-core
0 to upgrade, 1 to newly install, 0 to remove and 2 not to upgrade.
Need to get 58.6 kB of archives.
After this operation, 210 kB of additional disk space will be used.
Get:1 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 rdma-core amd64 28.0-1ubuntu1 [58.6 kB]
Fetched 58.6 kB in 0s (633 kB/s)
Selecting previously unselected package rdma-core.
(Reading database ... 218181 files and directories currently installed.)
Preparing to unpack .../rdma-core_28.0-1ubuntu1_amd64.deb ...
Unpacking rdma-core (28.0-1ubuntu1) ...
Setting up rdma-core (28.0-1ubuntu1) ...
iwpmd.service is a disabled or a static unit, not starting it.
rdma-hw.target is a disabled or a static unit, not starting it.
rdma-ndd.service is a disabled or a static unit, not starting it.
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for ureadahead (0.100.0-21) ...
ureadahead will be reprofiled on next reboot
Processing triggers for systemd (245.4-4ubuntu3.17) ...