Skip to content

Instantly share code, notes, and snippets.

@Steboss
Created June 4, 2025 13:03
Show Gist options
  • Save Steboss/629272f057ef01ab644f08e22f734445 to your computer and use it in GitHub Desktop.
Save Steboss/629272f057ef01ab644f08e22f734445 to your computer and use it in GitHub Desktop.
Example for using SLURM in Fuji
#!/bin/bash
#SBATCH -A something
#SBATCH -p some partition
#SBATCH -N 2 # number of nodes to use
#SBATCH -t
#SBATCH -J
export CONFIG="fuji-70B-v2-flash"
export CONTAINER="my-container"
export BASE_DIR="this is the dir where I want to save the outputs from SLURM + where my Python script is"
export BASE_SCRIPT="this is the name of the Python script I am using"
export GBS="global batch size"
read -r -d '' cmd <<'EOF'
# the only XLA FLAG I've used
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.9
# cd to the dir where I want to save outputs
cd ${BASE_DIR}
python3 $BASE_SCRIPT --output_log_file=/opt/host/output.log --module=text.gpt.c4_trainer --config=${CONFIG} --jax_backend=gpu --trainer_dir=/opt/host/axlearn-checkpoints --data_dir=gs://axlearn-public/tensorflow_datasets --ici_fsdp=8 --dcn_dp=2 --gbs=${GBS} --ga=1 --seq_len=4096 --max_step=301 --write_summary_steps=300 --num_processes=${SLURM_NTASKS} --distributed_coordinator=${SLURM_LAUNCH_NODE_IPADDR}:12345 --process_id=${SLURM_PROCID} --world_size=16
EOF
# folder for reporting output from slurm
FOLDER="some_folder"
mkdir -p "${FOLDER}"
OUTFILE="${FOLDER}/output-%j.txt"
srun \
-o "${OUTFILE}" \
-e "${OUTFILE}" \
--container-image=${CONTAINER} \
${MOUNTS} \
${EXPORTS} \
--container-remap=root \
bash -c "${cmd}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment