Note: Using Google Cloud is not free of charge!

Prerequisite

Creating the VM instance

Creating the TPU

Setting required environment variables

Adding a firewall rule to be able to remotely access Jupyter notebooks and Tensorboard

[Optional] Creating and mounting a persistent disk

[Optional] Copying local data to your mounted persistent disk

[Optional] Setting up an SSH key for Github access on your VM

Testing the TPU on the VM instance

Using Jupyter notebooks

Running and accessing Tensorboard

Background

I wanted to experiment with training/finetuning large language models, such as GPT-2 on TPUs (e.g. TPU v3-8) using PyTorch/XLA. Further, I wanted to use a VM that had Ubuntu installed, since Debian is less user friendly (to me at least!). Unfortunately, I found that a lot of information to setup the VM and TPU the way I wanted it, was scattered over different places on the web; there was not one tutorial that showed me in detail what all the steps involved are. This gist is my attempt to bring all the details together.

If you have any ideas for improvements, please let me know in the comments!

Running Pytorch on Google Cloud TPUs

This gist shows how to set up, using the gcloud cli, a Google Cloud TPU and an Ubuntu VM instance, for use with PyTorch. The goal is to be able to run Pytorch code, leveraging the TPU, directly from a terminal, as well as through Jupyter Notebooks in the browser. To be able to run PyTorch code on TPUs PyTorch/XLA is required.

I also wanted to be able to access a Tensorboard server, that would be running on the VM instance, to be able to follow the progress of training experiments remotely.

Prerequisite

In order to follow the steps below you need to setup the gcloud CLI on your local machine.

Be sure you have set the Google Cloud project. You can set it as follows:

gcloud config set project <PROJECT_ID>

If you haven't done so already, you can also set the default zone:

gcloud config set compute/region <NAME>

Creating the VM instance

First we need to create a VM instance to which the TPU will be connected. To do so, you need make a few choices:

Select a zone that fits your needs and where your TPU type of interest is available. This is important because all resources, e.g. the VM, the TPU and disks, should be available in the same zone to minimize networking latency. In this gist we choose zone europe-west4-a.
Select a virtual machine type. N1 instance types support TPUs, see here for an overview. Assuming we want to attach a TPUv2 or v3 with 8 TPU cores, a n1-standard-16 with 16 CPU cores would suffice.

Select a VM instance image. We need to select an image from the deeplearning-platform-release image project. The following command provides us an overview of available Ubuntu deeplearning images for use with Pytorch/XLA, also see here:

gcloud compute images list \
      --project deeplearning-platform-release \
      --no-standard-images | grep -i ubuntu | grep -i xla

NAME                                                           PROJECT                        FAMILY                                            DEPRECATED  STATUS
        
pytorch-1-6-xla-notebooks-v20201105-ubuntu-1804                deeplearning-platform-release  pytorch-1-6-xla-notebooks-ubuntu-1804                         READY
pytorch-1-6-xla-v20201105-ubuntu-1804                          deeplearning-platform-release  pytorch-1-6-xla-ubuntu-1804                                   READY
pytorch-1-7-xla-notebooks-v20210329-ubuntu-1804                deeplearning-platform-release  pytorch-1-7-xla-notebooks-ubuntu-1804                         READY
pytorch-1-7-xla-v20210329-ubuntu-1804                          deeplearning-platform-release  pytorch-1-7-xla-ubuntu-1804                                   READY
pytorch-1-8-xla-notebooks-v20210619-ubuntu-1804                deeplearning-platform-release  pytorch-1-8-xla-notebooks-ubuntu-1804                         READY
pytorch-1-8-xla-v20210619-ubuntu-1804                          deeplearning-platform-release  pytorch-1-8-xla-ubuntu-1804                                   READY
pytorch-1-9-xla-notebooks-v20210617-ubuntu-1804                deeplearning-platform-release  pytorch-1-9-xla-notebooks-ubuntu-1804                         READY
pytorch-1-9-xla-v20210617-ubuntu-1804                          deeplearning-platform-release  pytorch-1-9-xla-ubuntu-1804                                   READY
pytorch-latest-xla-v20210617-ubuntu-1804                       deeplearning-platform-release  pytorch-latest-xla-ubuntu-1804

Here we select pytorch-1-9-xla-notebooks-v20210617-ubuntu-1804 because it already has a Jupyter notebook server preinstalled.

Select boot disk size. Different tutorials suggested different boot disk sizes. Here we choose a size of 200GB which should be more than enough to install python and apt packages you need to run your code. Later on we will set up a network drive, which we will connect to the VM, where we can store our code, model checkpoints and (Tensorboard) logs.

Now, lets create the VM instance:

  gcloud compute instances create <INSTANCE_NAME> \
    --zone=europe-west4-a  \
    --machine-type=n1-standard-16  \
    --image-family=pytorch-1-9-xla-notebooks-ubuntu-1804  \
    --image-project=deeplearning-platform-release  \
    --boot-disk-size=200GB \
    --scopes=https://www.googleapis.com/auth/cloud-platform

Creating the TPU

The TPU is created from the VM instance, to do so, ssh in to the VM:

gcloud compute ssh <INSTANCE_NAME>

Now, lets created the TPU. Since we created the VM instance with an PyTorch 1.9 image, we choose --version=pytorch-1.9. In this case we choose to create a v3-8 TPU, but you could, for instance, also create a v2-8 TPU. We give the TPU the same name as the VM instance, although I don't think this is required. Don't forget to use the same zone as you used for the VM instance.

(vm)$ gcloud compute tpus create <INSTANCE_NAME> \
    --zone=europe-west4-a \
    --network=default \
    --version=pytorch-1.9 \
    --accelerator-type=v3-8

You can check if your TPU is available as follows:

(vm)$ gcloud compute tpus list --zone=europe-west4-a

Setting required environment variables

To use Pytorch/XLA the XRT_TPU_CONFIG environment variable needs to be set up. In order to do this only once, you can add it to the ~/.profile file on your VM.

XRT_TPU_CONFIG must have the following value:

(vm)$ export XRT_TPU_CONFIG="tpu_worker;0;<LOCAL_IP_ADDRESS>:8470"

You can find the LOCAL_IP_ADDRESS of your TPU as follows:

(vm)$ gcloud compute tpus describe <INSTANCE_NAME> --zone europe-west4-a

Among other things it will log the following

...
networkEndpoints:
- ipAddress: <LOCAL_IP_ADDRESS>
  port: 8470
...

Use your TPU local IP address for the XRT_TPU_CONFIG env. variable as indicated above.

To add the export to your ~/.profile:

nano ~/.profile

At the end of the file add:

export XRT_TPU_CONFIG="tpu_worker;0;<LOCAL_IP_ADDRESS>:8470"

Please be sure to start a new SSH session such that the export defined in ~/.profile takes effect.

Adding a firewall rule to be able to remotely access Jupyter notebooks and Tensorboard

To be able to access Jupyter notebooks, and the Tensorboard server on the VM instance, we need to allow incoming requests on the ports these applications listen on. To do so we need to add a firewall rule.

gcloud compute firewall-rules create allow-jupyter-tensorboard \
    --action allow \
    --target-tags deeplearning-vm \
    --source-ranges <YOUR_PUBLIC_IP>/32 \
    --rules tcp:6006,tcp:8888

You can omit the specific --source-ranges to allow incoming requests from any IP address, however this is insecure and is advised against.

Further, it is assumed here that the VM instance has the deeplearning-vm network tag; this tag is used to associate the firewall rule with the VM instance. You can check what network tags are attached to the VM instance as follows:

gcloud compute instances describe <INSTANCE_NAME> --zone=europe-west4-a

...
tags:
  ...
  items:
  ...
  - deeplearning-vm
...

If the tag you want to use is missing, you can add it as follows:

gcloud compute instances add-tags <INSTANCE_NAME> \
    --zone europe-west4-a \
    --tags <TAG_TO_ADD>

[Optional] Creating and mounting a persistent disk

Creating a separate network disk for your code, model/training checkpoints, Tensorboard logs etc., is convenient because it allows us to connect the disk to another instance later on; our storage will be independent of the specific compute we set up.

To create a disk with 500 GB of storage, from your local command line, execute the following command. See here for different disk types you can choose.

gcloud compute disks create <DISK_NAME> \
  --size 500 \
  --type pd-balanced

The disk can now be attached to your VM instance as follows:

gcloud compute instances attach-disk <INSTANCE_NAME> \
  --disk <DISK_NAME>

To now format and mount the disk you first need to be logged in to the VM again:

gcloud compute ssh <INSTANCE_NAME>

Here, I would like to refer to the Google cloud documentation on formatting and mounting a persistent disk, which is quite straightforward in my opinion.

Don't forget to configure automatic mounting of your disk on VM start!

I usually create a symbolic link from my user's home directory, on the VM instance, to the disk mount point. For instance, if I had configured the disk to be mounted at /mnt/disks/persistent:

ln -s /mnt/disks/persistent workspace

To be sure my VM instance user (e.g. freddy) is allowed to read and write on the disk I change the owner from root to freddy:

sudo chown freddy:freddy -R workspace/

[Optional] Copying local data to your mounted persistent disk

There are many ways of getting access to the data you need for your training purposes. Often I have data locally that I need to copy to the persistent disk, so I can use it on my VM.

To do so we can use the gcloud version of scp. Here we assume you have setup a symbolic link workspace, like I did in the previous section. Please also note the --recurse option, since it is assumed that your data is stored in a set of nested directories. Executed from your local machine:

gcloud compute scp --recurse <LOCAL_PATH_TO_DATA> <INSTANCE_NAME>:~/workspace/<REMOTE_PATH_TO_DATA>

[Optional] Setting up an SSH key for Github access on your VM

Usually you will have your code on Github (or some other git repo service). In order to clone your repo and interact with your repo on the VM you need to provide Github a SSH key.

To create an SSH key for access from the VM, first SSH into the VM (see above) and follow these instructions on Github on how to create a new SSH key and on how to add the SSH key to your Github account.

It basically amounts to creating the SSH key pair as follows:

ssh-keygen -t ed25519 -C "[email protected]"

... and subsequently print the public key part and copy it in to your clipboard:

cat ~/.ssh/id_ed25519.pub

The copied public key part should now be added as a new SSH key to your Github account.

Testing the TPU on the VM instance

gcloud compute ssh <INSTANCE_NAME>

Check if the XRT_TPU_CONFIG environment variable is set:

echo $XRT_TPU_CONFIG

If it is not set, please set this environment variable first, see above.

Now start python and do a simple calculation on your TPU!

(vm)$ python
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import torch
>>> import torch_xla.core.xla_model as xm
>>> dev = xm.xla_device()
>>> t1 = torch.rand(3, 3, device=dev)
>>> print(t1)
tensor([[0.5644, 0.1594, 0.2903],
        [0.0312, 0.8482, 0.8137],
        [0.5985, 0.6595, 0.6019]], device='xla:1')
>>> t2 = torch.ones(3, 3, device=dev)
>>> t1+t2
tensor([[1.5644, 1.1594, 1.2903],
        [1.0312, 1.8482, 1.8137],
        [1.5985, 1.6595, 1.6019]], device='xla:1')

Using Jupyter notebooks

In order to open Jupyter notebooks you first need to know the public IP address of your VM instance (be sure your VM is running when executing this command).

gcloud compute instances describe <INSTANCE_NAME> --zone=europe-west4-a

...
networkInterfaces:
- accessConfigs:
  - kind: compute#accessConfig
    name: external-nat
    natIP: <VM_PUBLIC_IP_ADDRESS>
    networkTier: PREMIUM
    type: ONE_TO_ONE_NAT
...

gcloud compute ssh <INSTANCE_NAME>

Change directory to the root of your repository, or wherever you have your notebooks. After that start the Jupyter notebook server.

(vm)$ jupyter notebook --ip 0.0.0.0 --no-browser --port=8888
...
To access the notebook, open this file in a browser:
    ...
    or http://127.0.0.1:8888/?token=<ACCESS_TOKEN>

Locally, in your browser, you should now be able to visit the Jupyter notebook server at

http://<VM_PUBLIC_IP_ADDRESS>:8888/?token=<ACCESS_TOKEN>

Running and accessing Tensorboard

This assumes that you have installed Tensorboard on your VM instance, e.g. using pip install tensorboard.

gcloud compute ssh <INSTANCE_NAME>

Start your tensorboard by pointing to the directory that contains your Tensorboard logs:

(vm)$ nohup tensorboard --logdir <PATH_TO_LOGS> --port 6006 --bind_all &

To access the tensorboard go to

http://<VM_PUBLIC_IP_ADDRESS>:6006/

See how your can find VM_PUBLIC_IP_ADDRESS in the previous section.

License

This gist is subject to a GNU General Public License v3.0.

visionscaper/settting_up_tpu_for_pytorch_gcloud.md

Contents

Background

Running Pytorch on Google Cloud TPUs

Prerequisite

Creating the VM instance

Creating the TPU

Setting required environment variables

Adding a firewall rule to be able to remotely access Jupyter notebooks and Tensorboard

[Optional] Creating and mounting a persistent disk

[Optional] Copying local data to your mounted persistent disk

[Optional] Setting up an SSH key for Github access on your VM

Testing the TPU on the VM instance

Using Jupyter notebooks

Running and accessing Tensorboard

License