-
Generate .nvprof files
nvprof -o pytorch.nvprof -f python3 test_pytorch.py nvprof -o pytorch_mp.nvprof -f python3 test_pytorch_mp.py nvprof -o cupy.nvprof -f python3 test_cupy.py nvprof -o cupy_mp.nvprof -f python3 test_cupy_mp.py
-
Open Nvidia Visual Profiler NVVP and load .nvprof files
PyTorch_for | PyTorch_mp | Cupy_for | Cupy_mp | |
---|---|---|---|---|
Initialization time | 41.74s | 42.86s | 24.69s | 23.27s |
Query time | 2.94s | 3.15s | 4.01s | 3.92s |
- Multi-threading version of PyTorch is slower than "for" version.
- Multi-threading version of Cupy is faster than "for" version.
- Different GPU kernels are executed by default streams in PyTorch.
- Different GPU kernsls are executed by separate streams in Cupy.
- Kernels' execution are async in PyTorch, while there are some gaps between kernels' execution in Cupy.
- Are there better ways to utilize multi-gpu in PyTorch or Cupy?
- What is the role of streams in multi-gpu execution?