A guide to fine-tune Stable Diffusion on the Vast.ai service

This guide is based on another more comprehensive one that can be found at https://note.com/kohya_ss/n/nbf7ce8d80f29. All of the necessary environment has been encapsulated into a docker image that can easily run on different cloud providers. (disclaimer: there seems to be a version mismatch with xformers, so we leave it off because it won't work)

Note: the docker image used for this exercise is based on Stable Diffusion 1.5 release and has the updated VAEs. By using the model contained in this image or any fine-tuned models derived from it, you agree to terms of use.

Make sure you have an account with Vast.ai and some credit.
Navigate to https://console.vast.ai/create/ and filter by A100 GPU type (others may work, but probably slower)
Click the "Edit Image Config" button to customize the image

In the popup, select jupyter-python notebook and check "Use Jupyter Lab". For the image enter wolfgangmeyers/sd_finetune:latest.

Select and Save
Rent a machine with desired number of GPUs. For this tutorial I've selected a machine with 2 PCI A100 cards.
Navigate to https://console.vast.ai/instances/ and you should see your instance starting up. This will take a few minutes, as the docker image is about 17GB and has to download onto the instance before you can connect. (It also builds a new custom image each time)

Once the instance has finished loading the image, you should see an "Open" button appear on the right.

Click on the Open button to launch the Jupyter Lab interface in another browser tab. It should look something like this:

Before fine-tuning you will need add training data. For this tutorial, I used the dataset that was used to train Pokemon Diffusers. I converted the dataset to the expected format, where each image file (.jpg) is paired with a caption file (.caption, plain text file). You can find a zip file of the dataset here.
This zip file is pretty big so it's easiest to download it to the instance. Open up a terminal from the launcher and run:

cd /
wget https://aibrush-private-artifacts.s3.us-west-2.amazonaws.com/pokemon.zip && unzip pokemon.zip && mv pokemon train_data

Activate the python virtual environment

source venv_diffusers/bin/activate

Configure accelerate. This is the point where you can configure accelerate to use multiple GPUs. In this case I have 2 so I will configure it to use multi-gpu.

/$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Do you want to use DeepSpeed? [yes/NO]: 
Do you want to use FullyShardedDataParallel? [yes/NO]: 
Do you want to use Megatron-LM ? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:   
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: bf16

Merge captions into metadata file

python merge_captions_to_metadata.py train_data meta_cap.json

Convert to latents and bucket by resolution

python prepare_buckets_latents.py train_data meta_cap.json meta_lat.json model.ckpt --batch_size 4 --max_resolution 512,512 --mixed_precision no

Launch the training script. This is the part that takes a while. Feel free to tweak parameters as you see fit.

accelerate launch --num_cpu_threads_per_process 8 fine_tune.py --pretrained_model_name_or_path=model.ckpt --in_json meta_lat.json --train_data_dir=train_data --output_dir=fine_tuned --shuffle_caption --train_batch_size=1 --learning_rate=5e-6 --max_train_steps=10000 --use_8bit_adam --gradient_checkpointing --mixed_precision=bf16 --save_every_n_epochs=4

Once completed, checkpoint files are saved in fine_tuned/ folder. To make these visible in the Jupyter Lab file interface, run the following:

mv fine_tuned /root/

Now you can use the file manager to navigate to the fine_tuned folder and download the new fine-tuned checkpoint files. The one with the most training will be last.ckpt, with other earlier checkpoints named by incremental epoch number.

wolfgangmeyers/finetune-stable-diffusion-vast.md

A guide to fine-tune Stable Diffusion on the Vast.ai service