Skip to content

Instantly share code, notes, and snippets.

@OguzPastirmaci
Created October 24, 2023 15:19
Show Gist options
  • Save OguzPastirmaci/35cd5ede46224bbb8fca1976c62e2548 to your computer and use it in GitHub Desktop.
Save OguzPastirmaci/35cd5ede46224bbb8fca1976c62e2548 to your computer and use it in GitHub Desktop.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-test-a100
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
initContainers:
- name: node-ordering-by-rack
image: oguzpastirmaci/node-ordering-by-rack:init-mpijob-v1
volumeMounts:
- name: node-ordering-by-rack
mountPath: "/node-ordering-by-rack"
- name: mpi-job-config
mountPath: /etc/mpi
- name: ssh-auth
mountPath: /root/.ssh
volumes:
- name: node-ordering-by-rack
emptyDir: {}
containers:
- image: oguzpastirmaci/nccl-tests:2.18.5
name: nccl-tests
volumeMounts:
- name: node-ordering-by-rack
mountPath: "/node-ordering-by-rack"
env:
- name: OMPI_ALLOW_RUN_AS_ROOT
value: "1"
- name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
value: "1"
#command: ['sleep', '86400']
command: ["/bin/bash", "-c"]
args: ["mpirun \
--bind-to numa \
--mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0 --mca coll ^hcoll \
--mca coll ^hcoll \
-x RX_QUEUE_LEN=8192 \
-x IB_RX_QUEUE_LEN=8192 \
-x UCX_TLS=tcp \
-x HCOLL_ENABLE_MCAST_ALL=0 \
-x coll_hcoll_enable=0 \
-x NCCL_DEBUG_SUBSYS=NONE \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_TIMEOUT=22 \
-x NCCL_IB_HCA=mlx5 \
-x NCCL_IB_SL=0 \
-x NCCL_IB_TC=41 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
-x NCCL_CUMEM_ENABLE=0 \
/opt/nccl_tests/build/all_reduce_perf -b 8 -e 4G -f 2 -g 1
"]
resources:
requests:
cpu: 2
memory: 128Mi
Worker:
replicas: 2
template:
metadata:
annotations:
k8s.v1.cni.cncf.io/networks: sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net, sriov-net
spec:
containers:
- image: oguzpastirmaci/nccl-tests:2.18.5
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
name: nccl
resources:
requests:
cpu: 100
memory: 750Gi
nvidia.com/gpu: 8
nvidia.com/sriov_rdma_vf: 16
limits:
nvidia.com/gpu: 8
nvidia.com/sriov_rdma_vf: 16
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- emptyDir:
medium: Memory
name: dshm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment