Skip to content

Instantly share code, notes, and snippets.

Cross EC2/Hyperpod cluster

In this guide we'll show you how to launch g6e instances in EC2 and connect to a Hyperpod cluster via ssh. These instances will live in the same AZ and mount the same filesystems as the cluster. In this guide we assume you already have a Hyperpod cluster you created by following the workshop content.

  1. We need to launch our g6e instances in the same VPC as your filesystem and in the Public Subnet we created in the initial Cloudformation template so you can SSH directly into the host using the public ip/hostname.

Make sure you're in your Local Environment:

exit # exit the cluster to local environment

Setup Mountpoint CSI driver.

First setup mountpoint following the instructions in the docs.

Steps to setup nvme w/ mountpoint:

Next we'll tell S3 Mountpoint to cache on the 28TB of local NVME available on each P5 instance.

  1. Mount the nvme disks as a single mount - this needs to be done on each p5 instance:
@sean-smith
sean-smith / bad-gpu-pc.md
Created May 2, 2024 16:23
Diagnose GPU Failures

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

  1. Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
@sean-smith
sean-smith / resize-ebs.md
Created April 25, 2024 19:42
Resize EBS Volume

Run out of EBS space on an ec2 instance?

  1. Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.

  2. Create a script called resize.sh with the following contents:

#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}
@sean-smith
sean-smith / install_nccl.md
Last active April 22, 2024 23:20
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./install-nccl.sh : and chmod +x install
#!/bin/bash

# install nccl

Install AWS OFI NCCL

  1. Change into the shared directory
cd /fsx
  1. Create a script install-nccl-aws-ofi.sh to install AWS OFI NCCL:
@sean-smith
sean-smith / torch_distributed.py
Created March 4, 2024 21:03
This is a fork of Meta's torch_distributed.py that works on SageMaker HyperPod
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
import os
import sys
#!/bin/bash
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync
apt install -y chrony
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart

Activate virtualenvs with python

  1. Install Virtualenvwrapper - this is my favorite way of creating virtualenvs
sudo apt-get install virtualenvwrapper
  1. Install on the compute as well, where 4 is the number of compute nodes:
@sean-smith
sean-smith / python-3-10.md
Last active January 30, 2024 17:50
Python 3.10 on Hyperpods

Ubuntu 20.04

  1. Create a script install-python.sh with the following content:
#!/bin/bash

sudo apt update 
sudo apt upgrade -y
sudo apt install software-properties-common -y