Contains some notes about achieved PCIe DMA bandwidth using FPGAs, using fpga_sio for the FPGA and test software.
A HP Z4 G4 with an Intel W-2123 CPU. Running AlmaLinux 8.10
Unless specified otherwise using 3 FPGAs fitted:
- NiteFury : Artix-7 PCIex x4
- TEF1001 : Kintex-7 PCIe2 x4
- XCKU5P_DUAL_QSFP : Kintex UltraScale+ PCIe3 x8
Uses the following bitstreams, in which the loopback is connected directly on the DMA bridge, using fixed connections between adjacent channels:
$ bin/release/identify_pcie_fpga_design/display_identified_pcie_fpga_designs
Opening device 0000:15:00.0 (10ee:7024) with IOMMU group 41
Enabled bus master for 0000:15:00.0
Opening device 0000:2d:00.0 (10ee:9038) with IOMMU group 85
Enabled bus master for 0000:2d:00.0
Opening device 0000:36:00.0 (10ee:7024) with IOMMU group 86
Enabled bus master for 0000:36:00.0
Design TEF1001_dma_stream_loopback:
PCI device 0000:15:00.0 IOMMU group 41
DMA bridge bar 2 AXI Stream
Channel ID addr_alignment len_granularity num_address_bits
H2C 0 1 1 64
H2C 1 1 1 64
C2H 0 1 1 64
C2H 1 1 1 64
User access build timestamp : C1310413 - 24/02/2024 16:16:19
Quad SPI registers at bar 0 offset 0x2000
XADC registers at bar 0 offset 0x3000
IIC registers at bar 0 offset 0x0
bit-banged I2C GPIO registers at bar 0 offset 0x1000
All 2 master ports in AXI4-Stream Switch are disabled
Design XCKU5P_DUAL_QSFP_dma_stream_loopback:
PCI device 0000:2d:00.0 IOMMU group 85
DMA bridge bar 2 AXI Stream
Channel ID addr_alignment len_granularity num_address_bits
H2C 0 1 1 64
H2C 1 1 1 64
H2C 2 1 1 64
H2C 3 1 1 64
C2H 0 1 1 64
C2H 1 1 1 64
C2H 2 1 1 64
C2H 3 1 1 64
User access build timestamp : F5B0E950 - 30/11/2024 14:37:16
Quad SPI registers at bar 0 offset 0x0
SYSMON registers at bar 0 offset 0x1000
All 4 master ports in AXI4-Stream Switch are disabled
Design NiteFury_dma_stream_loopback:
PCI device 0000:36:00.0 IOMMU group 86
DMA bridge bar 2 AXI Stream
Channel ID addr_alignment len_granularity num_address_bits
H2C 0 1 1 64
H2C 1 1 1 64
C2H 0 1 1 64
C2H 1 1 1 64
User access build timestamp : C1310A5B - 24/02/2024 16:41:27
Quad SPI registers at bar 0 offset 0x0
XADC registers at bar 0 offset 0x1000
All 2 master ports in AXI4-Stream Switch are disabled
While the FPGA designs don't contain the AXI4-Stream Switch, the values read from the undefined reads had the most significant bit set, which resulted in all the ports being reported as disabled.
Result with all streams tested in parallel:
[mr_halfword@skylake-alma release]$ xilinx_dma_bridge_for_pcie/test_dma_bridge_parallel_streams
Opening device 0000:15:00.0 (10ee:7024) with IOMMU group 41
Enabled bus master for 0000:15:00.0
Opening device 0000:2d:00.0 (10ee:9038) with IOMMU group 85
Enabled bus master for 0000:2d:00.0
Opening device 0000:36:00.0 (10ee:7024) with IOMMU group 86
Enabled bus master for 0000:36:00.0
Using num_descriptors=64 bytes_per_buffer=0x1000000 data_mapping_size_words=0x10000000
Selecting test of TEF1001_dma_stream_loopback design PCI device 0000:15:00.0 IOMMU group 41 H2C channel 0 C2H channel 1
Selecting test of TEF1001_dma_stream_loopback design PCI device 0000:15:00.0 IOMMU group 41 H2C channel 1 C2H channel 0
Selecting test of XCKU5P_DUAL_QSFP_dma_stream_loopback design PCI device 0000:2d:00.0 IOMMU group 85 H2C channel 0 C2H channel 1
Selecting test of XCKU5P_DUAL_QSFP_dma_stream_loopback design PCI device 0000:2d:00.0 IOMMU group 85 H2C channel 1 C2H channel 0
Selecting test of XCKU5P_DUAL_QSFP_dma_stream_loopback design PCI device 0000:2d:00.0 IOMMU group 85 H2C channel 2 C2H channel 3
Selecting test of XCKU5P_DUAL_QSFP_dma_stream_loopback design PCI device 0000:2d:00.0 IOMMU group 85 H2C channel 3 C2H channel 2
Selecting test of NiteFury_dma_stream_loopback design PCI device 0000:36:00.0 IOMMU group 86 H2C channel 0 C2H channel 1
Selecting test of NiteFury_dma_stream_loopback design PCI device 0000:36:00.0 IOMMU group 86 H2C channel 1 C2H channel 0
Press Ctrl-C to stop test
0000:15:00.0 0 -> 1 818.597 Mbytes/sec (8170504192 bytes in 9.981102 secs)
0000:15:00.0 1 -> 0 818.597 Mbytes/sec (8170504192 bytes in 9.981107 secs)
0000:2d:00.0 0 -> 1 1487.514 Mbytes/sec (14864613376 bytes in 9.992921 secs)
0000:2d:00.0 1 -> 0 1487.513 Mbytes/sec (14864613376 bytes in 9.992929 secs)
0000:2d:00.0 2 -> 3 1487.513 Mbytes/sec (14864613376 bytes in 9.992930 secs)
0000:2d:00.0 3 -> 2 1487.513 Mbytes/sec (14864613376 bytes in 9.992931 secs)
0000:36:00.0 0 -> 1 818.584 Mbytes/sec (8170504192 bytes in 9.981270 secs)
0000:36:00.0 1 -> 0 818.583 Mbytes/sec (8170504192 bytes in 9.981276 secs)
<<snip>>
0000:15:00.0 0 -> 1 818.598 Mbytes/sec (8187281408 bytes in 10.001590 secs)
0000:15:00.0 1 -> 0 818.598 Mbytes/sec (8187281408 bytes in 10.001589 secs)
0000:2d:00.0 0 -> 1 1487.519 Mbytes/sec (14864613376 bytes in 9.992892 secs)
0000:2d:00.0 1 -> 0 1487.519 Mbytes/sec (14864613376 bytes in 9.992892 secs)
0000:2d:00.0 2 -> 3 1487.519 Mbytes/sec (14864613376 bytes in 9.992893 secs)
0000:2d:00.0 3 -> 2 1487.519 Mbytes/sec (14864613376 bytes in 9.992893 secs)
0000:36:00.0 0 -> 1 818.594 Mbytes/sec (8187281408 bytes in 10.001636 secs)
0000:36:00.0 1 -> 0 818.594 Mbytes/sec (8187281408 bytes in 10.001636 secs)
^C 0000:15:00.0 0 -> 1 818.596 Mbytes/sec (5167382528 bytes in 6.312490 secs)
0000:15:00.0 1 -> 0 818.597 Mbytes/sec (5167382528 bytes in 6.312486 secs)
0000:2d:00.0 0 -> 1 1487.491 Mbytes/sec (8506048512 bytes in 5.718388 secs)
0000:2d:00.0 1 -> 0 1487.491 Mbytes/sec (8506048512 bytes in 5.718386 secs)
0000:2d:00.0 2 -> 3 1487.491 Mbytes/sec (8506048512 bytes in 5.718385 secs)
0000:2d:00.0 3 -> 2 1487.491 Mbytes/sec (8506048512 bytes in 5.718385 secs)
0000:36:00.0 0 -> 1 818.590 Mbytes/sec (5167382528 bytes in 6.312539 secs)
0000:36:00.0 1 -> 0 818.591 Mbytes/sec (5167382528 bytes in 6.312535 secs)
Overall test statistics:
0000:15:00.0 0 -> 1 818.599 Mbytes/sec (782824898560 bytes in 956.298503 secs)
0000:15:00.0 1 -> 0 818.599 Mbytes/sec (782824898560 bytes in 956.298504 secs)
0000:2d:00.0 0 -> 1 1487.534 Mbytes/sec (1421650952192 bytes in 955.709649 secs)
0000:2d:00.0 1 -> 0 1487.534 Mbytes/sec (1421650952192 bytes in 955.709654 secs)
0000:2d:00.0 2 -> 3 1487.534 Mbytes/sec (1421650952192 bytes in 955.709654 secs)
0000:2d:00.0 3 -> 2 1487.534 Mbytes/sec (1421650952192 bytes in 955.709654 secs)
0000:36:00.0 0 -> 1 818.598 Mbytes/sec (782824898560 bytes in 956.299334 secs)
0000:36:00.0 1 -> 0 818.598 Mbytes/sec (782824898560 bytes in 956.299336 secs)
0000:15:00.0 0 -> 1 Test pattern verified in 268435456 words
0000:15:00.0 1 -> 0 Test pattern verified in 268435456 words
0000:2d:00.0 0 -> 1 Test pattern verified in 268435456 words
0000:2d:00.0 1 -> 0 Test pattern verified in 268435456 words
0000:2d:00.0 2 -> 3 Test pattern verified in 268435456 words
0000:2d:00.0 3 -> 2 Test pattern verified in 268435456 words
0000:36:00.0 0 -> 1 Test pattern verified in 268435456 words
0000:36:00.0 1 -> 0 Test pattern verified in 268435456 words
Overall PASS
Result with only one pair of streams tested on each device:
[mr_halfword@skylake-alma release]$ xilinx_dma_bridge_for_pcie/test_dma_bridge_parallel_streams --stream_device 0000:15:00.0,0 --stream_device 0000:2d:00.0,0 --stream_device 0000:36:00.0,0
Opening device 0000:15:00.0 (10ee:7024) with IOMMU group 41
Enabled bus master for 0000:15:00.0
Opening device 0000:2d:00.0 (10ee:9038) with IOMMU group 85
Enabled bus master for 0000:2d:00.0
Opening device 0000:36:00.0 (10ee:7024) with IOMMU group 86
Enabled bus master for 0000:36:00.0
Using num_descriptors=64 bytes_per_buffer=0x1000000 data_mapping_size_words=0x10000000
Selecting test of TEF1001_dma_stream_loopback design PCI device 0000:15:00.0 IOMMU group 41 H2C channel 0 C2H channel 1
Selecting test of XCKU5P_DUAL_QSFP_dma_stream_loopback design PCI device 0000:2d:00.0 IOMMU group 85 H2C channel 0 C2H channel 1
Selecting test of NiteFury_dma_stream_loopback design PCI device 0000:36:00.0 IOMMU group 86 H2C channel 0 C2H channel 1
Press Ctrl-C to stop test
0000:15:00.0 0 -> 1 1637.195 Mbytes/sec (16357785600 bytes in 9.991346 secs)
0000:2d:00.0 0 -> 1 5952.485 Mbytes/sec (59508785152 bytes in 9.997302 secs)
0000:36:00.0 0 -> 1 1637.144 Mbytes/sec (16357785600 bytes in 9.991661 secs)
<<snip>>
0000:15:00.0 0 -> 1 1637.195 Mbytes/sec (16357785600 bytes in 9.991348 secs)
0000:2d:00.0 0 -> 1 5952.636 Mbytes/sec (59525562368 bytes in 9.999866 secs)
0000:36:00.0 0 -> 1 1637.189 Mbytes/sec (16374562816 bytes in 10.001631 secs)
^C 0000:15:00.0 0 -> 1 1637.194 Mbytes/sec (16374562816 bytes in 10.001600 secs)
0000:2d:00.0 0 -> 1 5952.540 Mbytes/sec (59525562368 bytes in 10.000027 secs)
0000:36:00.0 0 -> 1 1637.179 Mbytes/sec (16374562816 bytes in 10.001693 secs)
0000:15:00.0 0 -> 1 1304.633 Mbytes/sec (855638016 bytes in 0.655846 secs)
0000:2d:00.0 0 -> 1 1301.989 Mbytes/sec (234881024 bytes in 0.180402 secs)
0000:36:00.0 0 -> 1 1279.002 Mbytes/sec (838860800 bytes in 0.655871 secs)
Overall test statistics:
0000:15:00.0 0 -> 1 1637.198 Mbytes/sec (1408833159168 bytes in 860.515055 secs)
0000:2d:00.0 0 -> 1 5952.659 Mbytes/sec (5119517130752 bytes in 860.038640 secs)
0000:36:00.0 0 -> 1 1637.195 Mbytes/sec (1408816381952 bytes in 860.506279 secs)
0000:15:00.0 0 -> 1 Test pattern verified in 268435456 words
0000:2d:00.0 0 -> 1 Test pattern verified in 268435456 words
0000:36:00.0 0 -> 1 Test pattern verified in 268435456 words
Overall PASS