Skip to content

Instantly share code, notes, and snippets.

@wence-
Created June 28, 2022 16:25
Show Gist options
  • Save wence-/c0491ce6346dde2e9a11aea7a81b77aa to your computer and use it in GitHub Desktop.
Save wence-/c0491ce6346dde2e9a11aea7a81b77aa to your computer and use it in GitHub Desktop.
dask-cuda merge scaling (dask backend)
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 4
Other-chunks | 4
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 4.77 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
336.17 ms | 14.18 GiB/s
269.08 ms | 17.72 GiB/s
259.39 ms | 18.38 GiB/s
245.53 ms | 19.42 GiB/s
268.69 ms | 17.75 GiB/s
246.03 ms | 19.38 GiB/s
184.55 ms | 25.84 GiB/s
241.35 ms | 19.76 GiB/s
280.05 ms | 17.03 GiB/s
240.33 ms | 19.84 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 64
Other-chunks | 64
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 76.29 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
4.36 s | 17.50 GiB/s
1.98 s | 38.53 GiB/s
2.06 s | 36.98 GiB/s
1.96 s | 38.97 GiB/s
1.94 s | 39.29 GiB/s
1.94 s | 39.36 GiB/s
1.93 s | 39.59 GiB/s
1.93 s | 39.50 GiB/s
2.04 s | 37.39 GiB/s
2.09 s | 36.59 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 9.54 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
719.84 ms | 13.25 GiB/s
629.59 ms | 15.15 GiB/s
617.55 ms | 15.44 GiB/s
631.35 ms | 15.11 GiB/s
624.07 ms | 15.28 GiB/s
719.56 ms | 13.25 GiB/s
683.50 ms | 13.95 GiB/s
644.49 ms | 14.80 GiB/s
658.17 ms | 14.49 GiB/s
583.49 ms | 16.34 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 128
Other-chunks | 128
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 152.59 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
3.83 s | 39.80 GiB/s
2.94 s | 51.98 GiB/s
2.87 s | 53.16 GiB/s
3.01 s | 50.75 GiB/s
2.91 s | 52.41 GiB/s
2.90 s | 52.59 GiB/s
2.93 s | 52.01 GiB/s
3.10 s | 49.21 GiB/s
2.91 s | 52.36 GiB/s
2.97 s | 51.36 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 16
Other-chunks | 16
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 19.07 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
1.66 s | 11.50 GiB/s
988.85 ms | 19.29 GiB/s
1.09 s | 17.58 GiB/s
1.11 s | 17.14 GiB/s
1.04 s | 18.30 GiB/s
1.04 s | 18.27 GiB/s
970.95 ms | 19.64 GiB/s
971.89 ms | 19.63 GiB/s
971.03 ms | 19.64 GiB/s
1.08 s | 17.60 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 256
Other-chunks | 256
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 305.18 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
6.08 s | 50.18 GiB/s
5.64 s | 54.14 GiB/s
5.88 s | 51.87 GiB/s
5.54 s | 55.11 GiB/s
5.85 s | 52.18 GiB/s
5.60 s | 54.53 GiB/s
5.75 s | 53.11 GiB/s
5.65 s | 54.03 GiB/s
5.51 s | 55.41 GiB/s
5.85 s | 52.16 GiB/s
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 40000000
Base-chunks | 32
Other-chunks | 32
Broadcast | default
Protocol | ucx
Device(s) | 0
RMM Pool | True
Frac-match | 0.6
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 38.15 GiB
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
4.07 s | 9.38 GiB/s
1.51 s | 25.30 GiB/s
1.49 s | 25.65 GiB/s
1.43 s | 26.72 GiB/s
1.37 s | 27.94 GiB/s
1.48 s | 25.69 GiB/s
1.45 s | 26.33 GiB/s
1.43 s | 26.67 GiB/s
1.48 s | 25.74 GiB/s
1.43 s | 26.74 GiB/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment