This TPU VM cheatsheet uses and was tested with the following library versions:
Library | Version |
---|---|
JAX | 0.3.25 |
FLAX | 0.6.4 |
Datasets | 2.10.1 |
Transformers | 4.27.1 |
Chex | 0.1.6 |
Please note that it could work with later versions - but it's not guaranteed ;)
gcloud compute disks create lms --zone us-central1-a --size 1024G
Make sure, that your disk is in the same zone
as your TPU VM!
The following commands creates a v3-8 TPU VM and attaches the previously created disk to it:
gcloud alpha compute tpus tpu-vm create lms --zone us-central1-a --accelerator-type v3-8 \
--version tpu-vm-base --data-disk source=projects/<project-name>/zones/us-central1-a/disks/lms
Just run the following command to SSH into the TPU VM:
gcloud alpha compute tpus tpu-vm ssh lms --zone us-central1-a
After ssh'ing into TPU VM, run the following commands in e.g. tmux
.
sudo apt update -y && sudo apt install -y python3-venv
python3 -m venv $HOME/dev
source $HOME/dev/bin/activate
pip install "jax[tpu]==0.3.25" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install ipython requests
git clone https://github.com/huggingface/transformers.git
git clone https://github.com/huggingface/datasets.git
git clone https://github.com/google/flax.git
cd transformers && git checkout v4.27.1 && pip3 install -e . && cd ..
cd datasets && git checkout 2.10.1 && pip3 install -e . && cd ..
cd flax && git checkout v0.6.4 && pip3 install -e . && cd ..
pip install chex==0.1.6
# Useful symlinks
ln -s $HOME/transformers/examples/flax/language-modeling/run_bart_dlm_flax.py run_bart_dlm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_mlm_flax.py run_mlm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py
The attached disk needs to formatted first using:
sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb
After that it can be mounted via:
sudo mkdir -p /mnt/datasets
sudo mount -o discard,defaults /dev/sdb /mnt/datasets/
sudo chmod a+w /mnt/datasets
The HF dataset cache variable should now point to the mounted disk:
export HF_DATASETS_CACHE=/mnt/datasets/huggingface
The following commands create and activate a swapfile:
cd /mnt/datasets
sudo fallocate -l 50G ./swapfile
sudo chmod 600 ./swapfile
sudo mkswap ./swapfile
sudo swapon ./swapfile
Install TensorBoard to get better training metric visualizations:
pip install tensorboard tensorflow
Note: Installing tensorflow
avoid the following warning:
[21:19:25] - WARNING - __main__ - Unable to display metrics through TensorBoard because some package are not installed: No module named 'tensorflow'
When creating the disk, the unit of the disk size shouldn't be GB?