Skip to content

Instantly share code, notes, and snippets.

@alexarmbr
alexarmbr / profiling.md
Last active April 3, 2025 18:27
gpu profiling

Nsight Systems

add torch.cuda.cudart().cudaProfilerStart() and torch.cuda.cudart().cudaProfilerStop() where profiling should start and stop. launch profiler with

CUDA_VISIBLE_DEVICES=0,1,2,3
nsys profile \
-w true \
-t cuda,nvtx,osrt,cudnn,cublas \
@alexarmbr
alexarmbr / ring_attn.py
Last active March 15, 2025 05:59
Ring-Flash Attention
"""
test performance and correctness of ring attention vs. single gpu attention
torchrun --nproc-per-node 4 ring_attn.py
using 4 H100s I get:
Rank 0 single gpu attention: 261.78 ms
Rank 0 ring attention: 73.34 ms
"""
import os
import math
@alexarmbr
alexarmbr / benchmark_para_attn.py
Created March 5, 2025 18:47
a minimal correctness test and benchmark of ulysses style parallel attention from ParaAttention
"""
test performance and correctness of ulysses parallel attention vs single gpu attention
torchrun --nproc-per-node 2 benchmark_attn.py
using two H100s I get:
Rank 0 single gpu attention: 1698.14 ms
Rank 0 ulysses attention: 912.84 ms
running pip install para-attn should install everything needed
"""