Skip to content

Instantly share code, notes, and snippets.

@yzchen
Last active January 9, 2019 19:40
Show Gist options
  • Select an option

  • Save yzchen/f6f291570789346d44f709b39c751da1 to your computer and use it in GitHub Desktop.

Select an option

Save yzchen/f6f291570789346d44f709b39c751da1 to your computer and use it in GitHub Desktop.
Slurm job script template
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=4 # 4 tasks per node
#SBATCH --gres=gpu:4 # 4 GPUs per node
#SBATCH --exclusive
#SBATCH --partition=batch
#SBATCH -J p1-weak
#SBATCH -o /scratch/P100-Exps/Weak/logs/p1.out
#SBATCH -e /scratch/P100-Exps/Weak/errs/p1.err
#SBATCH --time=2-00:00:00
#SBATCH --constraint=[p100]
#SBATCH --exclude dgpu502-25
module load tensorflow
module load openmpi
module load nccl
HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod --user
for i in {1..34560}
do
echo $(date) >> /scratch/P100-Exps/Weak/logs/p1.out
sleep 5
done &
mpirun -np 4 -npernode 4 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python3 -u /scratch/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=resnet50 \
--data_dir=/scratch/tf_imagenet/train \
--train_dir=/scratch/P100-Exps/Weak/ckpts/p1 \
--data_name=imagenet --print_training_accuracy=true \
--weight_decay=1e-4 --optimizer=momentum --use_fp16=true \
--nodistortions \
--batch_size=128 --num_epochs=90 \
--variable_update=horovod --horovod_device=gpu \
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment