Skip to content

Instantly share code, notes, and snippets.

@dcasati
Last active May 24, 2024 10:42
Show Gist options
  • Save dcasati/c3176ed3d8c14f017bf6585805fc878f to your computer and use it in GitHub Desktop.
Save dcasati/c3176ed3d8c14f017bf6585805fc878f to your computer and use it in GitHub Desktop.
torchrun
#!/usr/bin/env bash
set -x
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export LOGLEVEL=DEBUG
export NCCL_DEBUG=warn
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/:/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnccl.so
export NCCL_IB_PCI_RELAXED_ORDERING=1
export UCX_IB_PCI_RELAXED_ORDERING=on
export UCX_MEM_EVENTS=n
export NCCL_IB_DISABLE=0
export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=eth0
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--max-restarts=1 \
--node_rank=1 \
--rdzv-id=test \
--rdzv_endpoint=task-rahuls-healthy-league-service:8081 \
oci_launch_scripts/k8s_nccl_job.py --backend=nccl
@dcasati
Copy link
Author

dcasati commented Sep 25, 2023

updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment