Skip to content

Instantly share code, notes, and snippets.

@sean-smith
Last active October 17, 2024 00:27
Show Gist options
  • Save sean-smith/47d9d169641c4403ec884465bfe76602 to your computer and use it in GitHub Desktop.
Save sean-smith/47d9d169641c4403ec884465bfe76602 to your computer and use it in GitHub Desktop.

Cross EC2/Hyperpod cluster

In this guide we'll show you how to launch g6e instances in EC2 and connect to a Hyperpod cluster via ssh. These instances will live in the same AZ and mount the same filesystems as the cluster. In this guide we assume you already have a Hyperpod cluster you created by following the workshop content.

  1. We need to launch our g6e instances in the same VPC as your filesystem and in the Public Subnet we created in the initial Cloudformation template so you can SSH directly into the host using the public ip/hostname.

Make sure you're in your Local Environment:

exit # exit the cluster to local environment
  1. Next we spin up a new g6e instance running the DLAMI Ubuntu 20.04 in the Public Subnet we created in 0. Prerequisites:
# create keypair if it doesn't exist and import it
if [ -f $HOME/.ssh/id_rsa.pub ]; then
    aws ec2 import-key-pair --key-name ssh_key --public-key-material fileb://$HOME/.ssh/id_rsa.pub
else
    ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
    aws ec2 import-key-pair --key-name ssh_key --public-key-material fileb://$HOME/.ssh/id_rsa.pub
fi

source env_vars
ubuntu_ami=$(aws ssm get-parameter --name /aws/service/deeplearning/ami/x86_64/base-oss-nvidia-driver-gpu-ubuntu-20.04/latest/ami-id --region us-east-1 --query "Parameter.Value" --output text | tr -d '"')

# launch the instance
aws ec2 run-instances \
    --image-id ${ubuntu_ami} \
    --instance-type g6e.48xlarge \
    --key-name ssh_key \
    --security-group-ids ${SECURITY_GROUP} \
    --subnet-id ${PUBLIC_SUBNET_ID} \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=bastion}]'
  1. Next add this host info to your ~/.ssh/config
# grab the public ip from the following command:
public_ip=$(aws ec2 describe-instances --filters 'Name=tag:Name,Values=bastion' --query 'Reservations[*].Instances[*].PublicIpAddress' --output text)

# add the hostname to the ~/.ssh/config
cat <<EOF >> ~/.ssh/config
Host bastion
  User ubuntu
  Hostname ${public_ip}
EOF

Next modify your security group to allow SSH traffic:

aws ec2 authorize-security-group-ingress \
    --group-id ${SECURITY_GROUP} \
    --protocol tcp \
    --port 22 \
    --cidr 0.0.0.0/32

Confirm that you can ssh in:

ssh bastion
# then exit back to local environment
exit
  1. Next we're going to mount the /fsx filesystem from the cluster, in order to do this we'll need to first install Lustre client:
wget -O - https://fsx-lustre-client-repo-public-keys.s3.amazonaws.com/fsx-ubuntu-public-key.asc | gpg --dearmor | sudo tee /usr/share/keyrings/fsx-ubuntu-public-key.gpg >/dev/null
sudo bash -c 'echo "deb [signed-by=/usr/share/keyrings/fsx-ubuntu-public-key.gpg] https://fsx-lustre-client-repo.s3.amazonaws.com/ubuntu focal main" > /etc/apt/sources.list.d/fsxlustreclientrepo.list && apt-get update'
sudo apt install -y lustre-client-modules-$(uname -r)
  1. Next, navigate to the FSx Console and click on your fsx filesystem. Click attach to get the mount commands. They will look similar to the following:
sudo mkdir /fsx
sudo mount -t lustre -o relatime,flock fs-007d09da6ab2684eg.fsx.us-west-2.amazonaws.com@tcp:/4gb3nbem /fsx
  1. Next we need to re-map ubuntu to /fsx/ubuntu so we can ssh directly into the bastion:
# allow ssh to root user temporarily:
sudo cp /home/ubuntu/.ssh/authorized_keys /root/.ssh/authorized_keys
exit # go back to local machine
# ssh into root directly
ssh root@bastion
# move the directory to the fsx directory, this adds the correct ssh key to access the cluster:
usermod -d /fsx/ubuntu ubuntu
rm /root/.ssh/authorized_keys
exit
  1. Next add the cluster's private ip address to your local ssh config. To do this we'll connect to the cluster and grab the ip address then exit and add it to our local ~/.ssh/config.
# grab the private ip address:
Admin:~ $ ./easy-ssh.sh -c controller-machine ml-cluster
....
root@ip-10-1-100-227:/usr/bin# hostname -I
10.1.84.107 169.254.0.1
root@ip-10-1-100-227:/usr/bin# exit

# add the ip to the ~/.ssh/config where 10.1.84.107 is the ip from hostname -I
cat <<EOF >> ~/.ssh/config
Host ml-cluster
  User ubuntu
  Hostname 10.1.84.107
EOF
  1. Now we can ssh into this host and use it as a jump host with the command:
ssh -J bastion ml-cluster

Voila! We just connected the g6e instance to the cluster via SSH and mounted the same filesystem to both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment