btrude · November 23, 2024 10:55 · leonardog27 · Dec 25, 2020
diff --git a/jukebox_extended_notebook.ipynb b/jukebox_extended_notebook.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Interacting with Jukebox",
      "provenance": [],
      "collapsed_sections": [],
      "machine_shape": "hm"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uq8uLwZCn0BV",
        "colab_type": "text"
      },
      "source": [
        "IMPORTANT NOTE ON SYSTEM REQUIREMENTS:\n",
        "\n",
        "If you are connecting to a hosted runtime, make sure it has a P100 GPU (optionally run !nvidia-smi to confirm). Go to Edit>Notebook Settings to set this.\n",
        "\n",
        "CoLab may first assign you a lower memory machine if you are using a hosted runtime.  If so, the first time you try to load the 5B model, it will run out of memory, and then you'll be prompted to restart with more memory (then return to the top of this CoLab).  If you continue to have memory issues after this (or run into issues on your own home setup), switch to the 1B model.\n",
        "\n",
        "If you are using a local GPU, we recommend V100 or P100 with 16GB GPU memory for best performance. For GPU’s with less memory, we recommend using the 1B model and a smaller batch size throughout.  \n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8qEqdj8u0gdN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!nvidia-smi -L"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VAMZK4GNA_PM",
        "colab_type": "text"
      },
      "source": [
        "Mount Google Drive to save sample levels as they are generated."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZPdMgaH_BPGN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from google.colab import drive\n",
        "drive.mount('/content/gdrive')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "source": [
        "Setup"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!pip install git+https://github.com/openai/jukebox.git\n",
        "!git clone https://github.com/openai/jukebox.git\n",
        "!git clone https://gist.github.com/3615a96d1c25ec8d9fb8bcaacf647e8b.git\n",
        "!git clone https://gist.github.com/btrude/8f2600072439f61a0603495d0fc7ff07.git\n",
        "!mv 3615a96d1c25ec8d9fb8bcaacf647e8b/sample_extended.py jukebox/jukebox/\n",
        "!mv 8f2600072439f61a0603495d0fc7ff07/train_no_ddp.py jukebox/jukebox/"
      ]
    },
    {
      "source": [
        "#### NOTE: This ipynb provides additional functionality compared to the default opanai notebook. These differences are enumerated below, but also it is worth noting that here the `levels` parameter actually works to stop the process after the specified level(s) have completed. You will notice that the examples below all use --levels=1 when doing bottom level ('level 2') sampling, and --levels=3 for upsampling, though --levels=2 should be used if you only need to do the middle level for whatever reason.\n",
        "\n",
        "#### This ipynb will also reduce or increase the number of `n_samples` as provided in each command so beware that mismatching `n_samples` won't be enforced and instead the number of samples will be automatically reduced. If increasing, the existing codes are duplicated  up to the new `n_samples` which may cause memory issues as well."
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "o7CzSiv0MmFP",
        "colab_type": "text"
      },
      "source": [
        "# Sampling\n",
        "---\n",
        "\n",
        "To sample normally, run the following command. Model can be `5b`, `5b_lyrics`, `1b_lyrics`\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b --levels=1 --sample_length_in_seconds=20 \\\n",
        "--total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125"
      ]
    },
    {
      "source": [
        "The above generates the first `sample_length_in_seconds` seconds of audio from a song of total length `total_sample_length_in_seconds`.\n",
        "\n",
        "To continue sampling from already generated codes for a longer duration, you can run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_continued --levels=1 --mode=continue \\\n",
        "--codes_file=sample_1b/level_2/data.pth.tar --sample_length_in_seconds=40 --total_sample_length_in_seconds=180 \\\n",
        "--sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125"
      ]
    },
    {
      "source": [
        "If you stopped sampling at only the first level and want to upsample the saved codes, you can run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_upsamples --levels=3 --mode=upsample \\\n",
        "--codes_file=sample_1b/level_2/data.pth.tar --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 \\\n",
        "--sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125"
      ]
    },
    {
      "source": [
        "If you want to prompt the model with your own creative piece or any other music, first save them as wave files and run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_prompted --levels=1 --mode=primed \\\n",
        "--audio_file=path/to/recording.wav,awesome-mix.wav,fav-song.wav,etc.wav --prompt_length_in_seconds=12 \\\n",
        "--sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125"
      ]
    },
    {
      "source": [
        "This ipynb also includes an additional mode, truncate, that lets you remove unwanted seconds from the end of sampling output. Consider the continuation example above which produced a 40 second sample. If we reuse that codes file in truncate mode we can remove any unwanted audio at the end of the sample and then continue like normal. Notice that `--sample_length_in_seconds` is reduced by 5 in this example."
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_truncated --levels=3 --mode=truncate \\\n",
        "--codes_file=sample_1b/level_2/data.pth.tar --sample_length_in_seconds=35 --total_sample_length_in_seconds=180 \\\n",
        "--sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125"
      ]
    },
    {
      "source": [
        "Each of these examples matches the README as provided by openai, but this ipynb needs additional functionality in order to change sample metadata from the command line. This means that each level's prompts, lyrics, and temperature are provided as hyperparameters for the purpose of this notebook. These parameters are listed below along with an example.\n",
        "\n",
        "```\n",
        "l2_meta_artist: The artist prompt for level 2\n",
        "l2_meta_genre: The genre prompt for level 2\n",
        "l2_meta_lyrics: The lyrics for level 2\n",
        "```\n",
        "```\n",
        "l1_meta_artist: The artist prompt for level 1\n",
        "l1_meta_genre: The genre prompt for level 1\n",
        "l1_meta_lyrics: The lyrics for level 1\n",
        "```\n",
        "```\n",
        "l0_meta_artist: The artist prompt for level 0\n",
        "l0_meta_genre: The genre prompt for level 0\n",
        "l0_meta_lyrics: The lyrics for level 0\n",
        "```\n",
        "```\n",
        "temperature: The temperature for level 2\n",
        "l1_temperature: The temperature for level 1\n",
        "10_temperature: The temperature for level 0\n",
        "```\n",
        "```\n",
        "pref_codes: Prefer specific codes on continuation or upsample (*see example below)\n",
        "```\n",
        "If you do not provide these parameters in the above cells, you will be using the defaults (artist=unknown, genre=unknown, lyrics=\"\"). A correct example with these things specified looks like:"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_raekwon --levels=1 --sample_length_in_seconds=20 \\\n",
        "--total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125 \\\n",
        "--temperature=0.98 --l2_meta_artist=raekwon --l2_meta_genre=psychedelic --l2_meta_lyrics='hello world good raekwon lyrics'"
      ]
    },
    {
      "source": [
        "Afterwards if we find one or more codes that we would like to continue on, we can then specify the `pref_codes` parameter in continuation mode and discard the unwanted codes. For example, if we found that we like samples 1, 3, and 5 from the previous command, we can use `--pref_codes=1,3,5`"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python ./jukebox/jukebox/sample_extended.py --model=1b_lyrics --name=sample_1b_raekwon_continued --levels=1 --sample_length_in_seconds=40 \\\n",
        "--total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125 \\\n",
        "--temperature=0.98 --l2_meta_artist=raekwon --l2_meta_genre=psychedelic --l2_meta_lyrics='hello world good raekwon lyrics' \\\n",
        "--mode=continue --codes_file=sample_1b_raekwon/level_2/data.pth.tar --pref_codes=1,3,5\n",
        "\n",
        "# Or if we only liked one set of codes then we should also add a comma to ensure the data is passed as a tuple to python: e.g. --pref_codes=3,"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train.py --hps=small_vqvae --name=small_vqvae --sample_length=262144 --bs=4 \\\n",
        "--audio_files_dir={audio_files_dir} --labels=False --train --aug_shift --aug_blend"
      ]
    },
    {
      "source": [
        "# Training\n",
        "---\n",
        "NOTE: Certain training steps require that you download files from google drive in the left sidebar and then edit them in a text editor and put them back into google drive. If you are doing this frequently you should store a local copy for easy reuse.\n",
        "\n",
        "## VQVAE\n",
        "To train a small vqvae, run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train_no_ddp.py --hps=small_vqvae --name=small_vqvae --sample_length=262144 --bs=4 \\\n",
        "--audio_files_dir={audio_files_dir} --labels=False --train --aug_shift --aug_blend"
      ]
    },
    {
      "source": [
        "Here, `{audio_files_dir}` is the directory in which you can put the audio files for your dataset. The above trains a two-level VQ-VAE with `downs_t = (5,3)`, and `strides_t = (2, 2)` meaning we downsample the audio by `2**5 = 32` to get the first level of codes, and `2**8 = 256` to get the second level codes.  \n",
        "Checkpoints are stored in the `logs` folder (e.g. `jukebox/logs/small_vqvae`)."
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "source": [
        "## Prior\n",
        "### Train prior or upsamplers\n",
        "Once the VQ-VAE is trained, we can restore it from its saved checkpoint and train priors on the learnt codes. \n",
        "To train the top-level prior, we can run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train_no_ddp.py --hps=small_vqvae,small_prior,all_fp16,cpu_ema --name=small_prior \\\n",
        "--sample_length=2097152 --bs=4 --audio_files_dir={audio_files_dir} --labels=False --train --test --aug_shift --aug_blend \\\n",
        "--restore_vqvae=logs/small_vqvae/checkpoint_latest.pth.tar --prior --levels=2 --level=1 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "To train the upsampler, we can run\n"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train_no_ddp.py --hps=small_vqvae,small_upsampler,all_fp16,cpu_ema --name=small_upsampler \\\n",
        "--sample_length=262144 --bs=4 --audio_files_dir={audio_files_dir} --labels=False --train --test --aug_shift --aug_blend \\\n",
        "--restore_vqvae=logs/small_vqvae/checkpoint_latest.pth.tar --prior --levels=2 --level=0 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "We pass `sample_length = n_ctx * downsample_of_level` so that after downsampling the tokens match the n_ctx of the prior hps. \n",
        "Here, `n_ctx = 8192` and `downsamples = (32, 256)`, giving `sample_lengths = (8192 * 32, 8192 * 256) = (65536, 2097152)` respectively for the bottom and top level.\n",
        "\n",
        "### Learning rate annealing\n",
        "To get the best sample quality anneal the learning rate to 0 near the end of training. To do so, continue training from the latest \n",
        "checkpoint and run with\n",
        "```\n",
        "--restore_prior=\"path/to/checkpoint\" --lr_use_linear_decay --lr_start_linear_decay={already_trained_steps} --lr_decay={decay_steps_as_needed}\n",
        "```\n",
        "\n",
        "### Reuse pre-trained VQ-VAE and train top-level prior on new dataset from scratch.\n",
        "#### Train without labels\n",
        "Our pre-trained VQ-VAE can produce compressed codes for a wide variety of genres of music, and the pre-trained upsamplers \n",
        "can upsample them back to audio that sound very similar to the original audio.\n",
        "To re-use these for a new dataset of your choice, you can retrain just the top-level  \n",
        "\n",
        "To train top-level on a new dataset, run\n"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior \\\n",
        "--sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n",
        "--labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "Training the `small_prior` with a batch size of 2, 4, and 8 requires 6.7 GB, 9.3 GB, and 15.8 GB of GPU memory, respectively. A few days to a week of training typically yields reasonable samples when the dataset is homogeneous (e.g. all piano pieces, songs of the same style, etc).\n",
        "\n",
        "#### Sample from new model\n",
        "You can then run sample.py with the top-level of our models replaced by your new model. To do so,\n",
        "- Add an entry `my_model=(\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"small_prior\")` in `MODELS` in `make_models.py`. \n",
        "- Update the `small_prior` dictionary in `hparams.py` to include `restore_prior='path/to/checkpoint'`. If you\n",
        "you changed any hps directly in the command line script (eg:`heads`), make sure to update them in the dictionary too so \n",
        "that `make_models` restores our checkpoint correctly.\n",
        "- Run sample.py as outlined in the sampling section, but now with `--model=my_model` \n",
        "\n",
        "For example, let's say we trained `small_vqvae`, `small_prior`, and `small_upsampler` under `/path/to/jukebox/logs`. In `make_models.py`, we are going to declare a tuple of the new models as `my_model`.\n",
        "```\n",
        "MODELS = {\n",
        "    '5b': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_5b\"),\n",
        "    '5b_lyrics': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_5b_lyrics\"),\n",
        "    '1b_lyrics': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_1b_lyrics\"),\n",
        "    'my_model': (\"my_small_vqvae\", \"my_small_upsampler\", \"my_small_prior\"),\n",
        "}\n",
        "```\n",
        "\n",
        "Next, in `hparams.py`, we add them to the registry with the corresponding `restore_`paths and any other command line options used during training. Another important note is that for top-level priors with lyric conditioning, we have to locate a self-attention layer that shows alignment between the lyric and music tokens. Look for layers where `prior.prior.transformer._attn_mods[layer].attn_func` is either 6 or 7. If your model is starting to sing along lyrics, it means some layer, head pair has learned alignment. Congrats!\n",
        "```\n",
        "my_small_vqvae = Hyperparams(\n",
        "    restore_vqvae='/path/to/jukebox/logs/small_vqvae/checkpoint_some_step.pth.tar',\n",
        ")\n",
        "my_small_vqvae.update(small_vqvae)\n",
        "HPARAMS_REGISTRY[\"my_small_vqvae\"] = my_small_vqvae\n",
        "\n",
        "my_small_prior = Hyperparams(\n",
        "    restore_prior='/path/to/jukebox/logs/small_prior/checkpoint_latest.pth.tar',\n",
        "    level=1,\n",
        "    labels=False,\n",
        "    # TODO For the two lines below, if `--labels` was used and the model is\n",
        "    # trained with lyrics, find and enter the layer, head pair that has learned\n",
        "    # alignment.\n",
        "    alignment_layer=47,\n",
        "    alignment_head=0,\n",
        ")\n",
        "my_small_prior.update(small_prior)\n",
        "HPARAMS_REGISTRY[\"my_small_prior\"] = my_small_prior\n",
        "\n",
        "my_small_upsampler = Hyperparams(\n",
        "    restore_prior='/path/to/jukebox/logs/small_upsampler/checkpoint_latest.pth.tar',\n",
        "    level=0,\n",
        "    labels=False,\n",
        ")\n",
        "my_small_upsampler.update(small_upsampler)\n",
        "HPARAMS_REGISTRY[\"my_small_upsampler\"] = my_small_upsampler\n",
        "```\n",
        "\n",
        "\n",
        "#### Train with labels \n",
        "To train with you own metadata for your audio files, implement `get_metadata` in `data/files_dataset.py` to return the \n",
        "`artist`, `genre` and `lyrics` for a given audio file. For now, you can pass `''` for lyrics to not use any lyrics.\n",
        "\n",
        "For training with labels, we'll use `small_labelled_prior` in `hparams.py`, and we set `labels=True,labels_v3=True`. \n",
        "We use 2 kinds of labels information:\n",
        "- Artist/Genre: \n",
        "  - For each file, we return an artist_id and a list of genre_ids. The reason we have a list and not a single genre_id \n",
        "  is that in v2, we split genres like `blues_rock` into a bag of words `[blues, rock]`, and we pass atmost \n",
        "  `max_bow_genre_size` of those, in `v3` we consider it as a single word and just set `max_bow_genre_size=1`.\n",
        "  - Update the `v3_artist_ids` and `v3_genre_ids` to use ids from your new dataset. \n",
        "  - In `small_labelled_prior`, set the hps `y_bins = (number_of_genres, number_of_artists)` and `max_bow_genre_size=1`. \n",
        "- Timing: \n",
        "  - For each chunk of audio, we return the `total_length` of the song, the `offset` the current audio chunk is at and \n",
        "  the `sample_length` of the audio chunk. We have three timing embeddings: total_length, our current position, and our \n",
        "  current position as a fraction of the total length, and we divide the range of these values into `t_bins` discrete bins. \n",
        "  - In `small_labelled_prior`, set the hps `min_duration` and `max_duration` to be the shortest/longest duration of audio \n",
        "  files you want for your dataset, and `t_bins` for how many bins you want to discretize timing information into. Note \n",
        "  `min_duration * sr` needs to be at least `sample_length` to have an audio chunk in it.\n",
        "\n",
        "After these modifications, to train a top-level with labels, run\n"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train.py --hps=vqvae,small_labelled_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior_labels \\\n",
        "--sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n",
        "--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "For sampling, follow same instructions as above but use `small_labelled_prior` instead of `small_prior`.\n",
        "\n",
        "#### Train with lyrics\n",
        "To train in addition with lyrics, update `get_metadata` in `data/files_dataset.py` to return `lyrics` too.\n",
        "For training with lyrics, we'll use `small_single_enc_dec_prior` in `hparams.py`. \n",
        "- Lyrics: \n",
        "  - For each file, we linearly align the lyric characters to the audio, find the position in lyric that corresponds to \n",
        "  the midpoint of our audio chunk, and pass a window of `n_tokens` lyric characters centred around that. \n",
        "  - In `small_single_enc_dec_prior`, set the hps `use_tokens=True` and `n_tokens` to be the number of lyric characters \n",
        "  to use for an audio chunk. Set it according to the `sample_length` you're training on so that its large enough that \n",
        "  the lyrics for an audio chunk are almost always found inside a window of that size.\n",
        "  - If you use a non-English vocabulary, update `text_processor.py` with your new vocab and set\n",
        "  `n_vocab = number of characters in vocabulary` accordingly in `small_single_enc_dec_prior`. In v2, we had a `n_vocab=80` \n",
        "  and in v3 we missed `+` and so `n_vocab=79` of characters. \n",
        "\n",
        "After these modifications, to train a top-level with labels and lyrics, run"
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train.py --hps=vqvae,small_single_enc_dec_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_single_enc_dec_prior_labels \\\n",
        "--sample_length=786432 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n",
        "--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "To simplify hps choices, here we used a `single_enc_dec` model like the `1b_lyrics` model that combines both encoder and \n",
        "decoder of the transformer into a single model. We do so by merging the lyric vocab and vq-vae vocab into a single \n",
        "larger vocab, and flattening the lyric tokens and the vq-vae codes into a single sequence of length `n_ctx + n_tokens`. \n",
        "This uses `attn_order=12` which includes `prime_attention` layers with keys/values from lyrics and queries from audio. \n",
        "If you instead want to use a model with the usual encoder-decoder style transformer, use `small_sep_enc_dec_prior`.\n",
        "\n",
        "For sampling, follow same instructions as [above](#sample-from-new-model) but use `small_single_enc_dec_prior` instead of \n",
        "`small_prior`. To also get the alignment between lyrics and samples in the saved html, you'll need to set `alignment_layer` \n",
        "and `alignment_head` in `small_single_enc_dec_prior`. To find which layer/head is best to use, run a forward pass on a training example,\n",
        "save the attention weight tensors for all prime_attention layers, and pick the (layer, head) which has the best linear alignment \n",
        "pattern between the lyrics keys and music queries.\n",
        "\n",
        "### Fine-tune pre-trained top-level prior to new style(s)\n",
        "Previously, we showed how to train a small top-level prior from scratch. Assuming you have a GPU with at least 15 GB of memory and support for fp16, you could fine-tune from our pre-trained 1B top-level prior. Here are the steps:\n",
        "\n",
        "- Support `--labels=True` by implementing `get_metadata` in `jukebox/data/files_dataset.py` for your dataset.\n",
        "- Add new entries in `jukebox/data/ids`. We recommend replacing existing mappings (e.g. rename `\"unknown\"`, etc with styles of your choice). This uses the pre-trained style vectors as initialization and could potentially save some compute.\n",
        "\n",
        "After these modifications, run "
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!python jukebox/train.py --hps=vqvae,prior_1b_lyrics,all_fp16,cpu_ema --name=finetuned \\\n",
        "--sample_length=1048576 --bs=1 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n",
        "--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000"
      ]
    },
    {
      "source": [
        "To get the best sample quality, it is recommended to anneal the learning rate in the end. Training the 5B top-level requires GPipe which is not supported in this release."
      ],
      "cell_type": "markdown",
      "metadata": {}
    },
    {
      "source": [
        "#### Citation\n",
        "```\n",
        "@article{dhariwal2020jukebox,\n",
        "  title={Jukebox: A Generative Model for Music},\n",
        "  author={Dhariwal, Prafulla and Jun, Heewoo and Payne, Christine and Kim, Jong Wook and Radford, Alec and Sutskever, Ilya},\n",
        "  journal={arXiv preprint arXiv:2005.00341},\n",
        "  year={2020}\n",
        "}\n",
        "```"
      ],
      "cell_type": "markdown",
      "metadata": {}
    }
  ]
 }