add torch.cuda.cudart().cudaProfilerStart()
and torch.cuda.cudart().cudaProfilerStop()
where profiling should start and stop.
launch profiler with
CUDA_VISIBLE_DEVICES=0,1,2,3
nsys profile \
-w true \
-t cuda,nvtx,osrt,cudnn,cublas \
""" | |
test performance and correctness of ring attention vs. single gpu attention | |
torchrun --nproc-per-node 4 ring_attn.py | |
using 4 H100s I get: | |
Rank 0 single gpu attention: 261.78 ms | |
Rank 0 ring attention: 73.34 ms | |
""" | |
import os | |
import math |
""" | |
test performance and correctness of ulysses parallel attention vs single gpu attention | |
torchrun --nproc-per-node 2 benchmark_attn.py | |
using two H100s I get: | |
Rank 0 single gpu attention: 1698.14 ms | |
Rank 0 ulysses attention: 912.84 ms | |
running pip install para-attn should install everything needed | |
""" |