To diagnose a node with a bad gpu ip-10-1-69-242
on ParallelCluster, do the following:
- Run the nvidia reset command where
0
is the device index shown bynvidia-smi
of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
- If that doesn't success then generate a bug report:
srun -w ip-10-1-69-242 nvidia-bug-report.sh
- Grab the instance id:
srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "
- Grab the output of
nvidia-bug-report.sh
and replace that instance where<instance-id>
is the instance id from above.
aws ec2 terminate-instances \
--instance-ids <instance-id>
- ParallelCluster will re-launch the instance and you'll see a new instance come up in the EC2 console.