Last active
December 19, 2024 15:43
-
-
Save carlthome/c635d6e96c2542bdc47d0f0d7373551d to your computer and use it in GitHub Desktop.
noise2music-inspired-automatic-music-captioning.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"provenance": [], | |
"gpuType": "T4", | |
"authorship_tag": "ABX9TyNECq1gn5XkIoCrgufdMuwK", | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"accelerator": "GPU" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/carlthome/c635d6e96c2542bdc47d0f0d7373551d/noise2music-inspired-automatic-music-captioning.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"# noise2music-inspired automatic music captioning\n", | |
"\n", | |
"In [noise2music](https://google-research.github.io/noise2music/), the training dataset is created by pseudo-labeling a vast collection of unlabeled music audio using two advanced deep learning models. A large language model generates a diverse set of general music-related descriptive sentences to serve as potential captions. These captions are then matched to individual music clips through zero-shot classification, leveraging a pre-trained joint embedding model designed for music and text.\n", | |
"\n", | |
"So being curious, let's try the following:\n", | |
"\n", | |
"1. Generate a lot of music descriptions with a Meta Llama 3.2 LLM.\n", | |
"1. Embed the generated music descriptions with a LAION CLAP text encoder.\n", | |
"1. Index the text embeddings for nearest neighbor retrieval with FAISS.\n", | |
"1. Use the corresponding audio encoder to embed an audio example.\n", | |
"1. Use the audio embedding as search query for retrieving text embeddings.\n", | |
"\n", | |
"**Could this simple method produce reasonable audio captions?**" | |
], | |
"metadata": { | |
"id": "EVsGy9kxKMn-" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"pip install -q datasets faiss-cpu" | |
], | |
"metadata": { | |
"id": "vhPejzFwbCo1" | |
}, | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"import torch\n", | |
"import faiss\n", | |
"import transformers\n", | |
"import datasets\n", | |
"import polars as pl\n", | |
"import librosa as lr\n", | |
"import numpy as np\n", | |
"import tqdm.auto as tqdm\n", | |
"import seaborn as sns\n", | |
"\n", | |
"# Configure plotting.\n", | |
"pl.Config.set_fmt_str_lengths(256)\n", | |
"sns.set_style(\"ticks\")\n", | |
"sns.set_theme(\"notebook\")\n", | |
"\n", | |
"# Download some example audio files.\n", | |
"dataset = datasets.load_dataset(\"marsyas/gtzan\", trust_remote_code=True)\n", | |
"\n", | |
"# Download a pretrained text generation model.\n", | |
"text_generator = transformers.pipeline(\n", | |
" task=\"text-generation\",\n", | |
" model=\"meta-llama/Llama-3.2-1B-Instruct\",\n", | |
" torch_dtype=torch.bfloat16,\n", | |
" device_map=\"auto\",\n", | |
")\n", | |
"\n", | |
"# Download a pretrained CLAP model.\n", | |
"clap_model = transformers.ClapModel.from_pretrained(\"laion/larger_clap_general\")\n", | |
"clap_processor = transformers.ClapProcessor.from_pretrained(\"laion/larger_clap_general\")" | |
], | |
"metadata": { | |
"id": "Ktlv3YEs9agI", | |
"outputId": "8d9854d0-1bb7-4d48-b2b9-5224fe65b97d", | |
"colab": { | |
"base_uri": "https://localhost:8080/" | |
} | |
}, | |
"execution_count": 2, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": [ | |
"Device set to use cuda:0\n" | |
] | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"descriptions = pl.read_parquet(\"music_descriptions.parquet\")\n", | |
"descriptions" | |
], | |
"metadata": { | |
"id": "utr6A-OvlhP6", | |
"outputId": "e28d13a7-d7a1-4d6a-d874-fd76ba2f7fa7", | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 443 | |
} | |
}, | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"shape: (100, 1)\n", | |
"┌──────────────────────────────────────────────────────────────────────────────────────────────────┐\n", | |
"│ generated_text │\n", | |
"│ --- │\n", | |
"│ str │\n", | |
"╞══════════════════════════════════════════════════════════════════════════════════════════════════╡\n", | |
"│ The piece I'm describing is a hauntingly beautiful electronic pop track that blends elements of │\n", | |
"│ ambient, synth-pop, and indie electronic music. It begins with a delicate │\n", | |
"│ This piece of music has a prominent, pulsing electronic beat, accompanied by the thumping │\n", | |
"│ bassline of a drum machine. The synths provide a bright, │\n", | |
"│ The piece I'm describing is a blend of electronic dance music and hip-hop, with a prominent │\n", | |
"│ bassline and driving beat. The instrumentation features a combination of synthes │\n", | |
"│ The piece in question is a fusion of electronic and rock elements, featuring a prominent bass │\n", | |
"│ line, driving drum patterns, and intricate synthesizer work. The instrumentation is │\n", | |
"│ This piece of music is a blend of indie rock and electronic elements. It begins with a driving │\n", | |
"│ beat, accompanied by the prominent use of synthesizers and a puls │\n", | |
"│ … │\n", | |
"│ Imagine a mesmerizing piece of music that combines elements of electronic dance music, indie │\n", | |
"│ rock, and world music. The sound is a dynamic blend of pulsating syn │\n", | |
"│ Imagine a contemporary electronic dance track with a haunting, atmospheric quality. The │\n", | |
"│ foundation is provided by a prominent bassline, driving a rhythmic pattern that's both driving │\n", | |
"│ This piece of music is a high-energy electronic dance track. It features a prominent synthesizer │\n", | |
"│ riff, often accompanied by a driving kick drum and energetic percussion. The │\n", | |
"│ The piece I've chosen is a 2010 electronic dance track by Swedish DJ, Avicii. │\n", | |
"│ │\n", | |
"│ This song has a typical electronic dance music (ED │\n", | |
"│ This piece of music is a fusion of electronic and orchestral elements, featuring a prominent │\n", | |
"│ synthesizer as the main instrument. It begins with a soft, pulsing │\n", | |
"└──────────────────────────────────────────────────────────────────────────────────────────────────┘" | |
], | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr,\n", | |
".dataframe > tbody > tr {\n", | |
" text-align: right;\n", | |
" white-space: pre-wrap;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (100, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>generated_text</th></tr><tr><td>str</td></tr></thead><tbody><tr><td>"The piece I'm describing is a hauntingly beautiful electronic pop track that blends elements of ambient, synth-pop, and indie electronic music. It begins with a delicate"</td></tr><tr><td>"This piece of music has a prominent, pulsing electronic beat, accompanied by the thumping bassline of a drum machine. The synths provide a bright,"</td></tr><tr><td>"The piece I'm describing is a blend of electronic dance music and hip-hop, with a prominent bassline and driving beat. The instrumentation features a combination of synthes"</td></tr><tr><td>"The piece in question is a fusion of electronic and rock elements, featuring a prominent bass line, driving drum patterns, and intricate synthesizer work. The instrumentation is"</td></tr><tr><td>"This piece of music is a blend of indie rock and electronic elements. It begins with a driving beat, accompanied by the prominent use of synthesizers and a puls"</td></tr><tr><td>…</td></tr><tr><td>"Imagine a mesmerizing piece of music that combines elements of electronic dance music, indie rock, and world music. The sound is a dynamic blend of pulsating syn"</td></tr><tr><td>"Imagine a contemporary electronic dance track with a haunting, atmospheric quality. The foundation is provided by a prominent bassline, driving a rhythmic pattern that's both driving"</td></tr><tr><td>"This piece of music is a high-energy electronic dance track. It features a prominent synthesizer riff, often accompanied by a driving kick drum and energetic percussion. The"</td></tr><tr><td>"The piece I've chosen is a 2010 electronic dance track by Swedish DJ, Avicii. \n", | |
"\n", | |
"This song has a typical electronic dance music (ED"</td></tr><tr><td>"This piece of music is a fusion of electronic and orchestral elements, featuring a prominent synthesizer as the main instrument. It begins with a soft, pulsing"</td></tr></tbody></table></div>" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 6 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"colab": { | |
"base_uri": "https://localhost:8080/", | |
"height": 210 | |
}, | |
"id": "yrF4CdDk8irv", | |
"outputId": "7adcc31f-d3e7-4589-c0de-0214cd01037d" | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stderr", | |
"text": [ | |
"Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n" | |
] | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": [ | |
"shape: (3, 1)\n", | |
"┌───────────────────────────────────────────────────────────────────────────────────────────────┐\n", | |
"│ generated_text │\n", | |
"│ --- │\n", | |
"│ str │\n", | |
"╞═══════════════════════════════════════════════════════════════════════════════════════════════╡\n", | |
"│ This piece of music is a blend of electronic dance music (EDM) and indie rock, with a dash of │\n", | |
"│ synth-pop. It features a prominent, puls │\n", | |
"│ This piece of music is a fusion of electronic and indie rock elements. It starts with a │\n", | |
"│ prominent piano melody, accompanied by a minimalist drum pattern and a soft, │\n", | |
"│ This piece of music is a fusion of electronic and orchestral elements, featuring a prominent │\n", | |
"│ synthesizer as the main instrument. It begins with a soft, pulsing │\n", | |
"└───────────────────────────────────────────────────────────────────────────────────────────────┘" | |
], | |
"text/html": [ | |
"<div><style>\n", | |
".dataframe > thead > tr,\n", | |
".dataframe > tbody > tr {\n", | |
" text-align: right;\n", | |
" white-space: pre-wrap;\n", | |
"}\n", | |
"</style>\n", | |
"<small>shape: (3, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>generated_text</th></tr><tr><td>str</td></tr></thead><tbody><tr><td>"This piece of music is a blend of electronic dance music (EDM) and indie rock, with a dash of synth-pop. It features a prominent, puls"</td></tr><tr><td>"This piece of music is a fusion of electronic and indie rock elements. It starts with a prominent piano melody, accompanied by a minimalist drum pattern and a soft,"</td></tr><tr><td>"This piece of music is a fusion of electronic and orchestral elements, featuring a prominent synthesizer as the main instrument. It begins with a soft, pulsing"</td></tr></tbody></table></div>" | |
] | |
}, | |
"metadata": {}, | |
"execution_count": 4 | |
} | |
], | |
"source": [ | |
"# Load existing descriptions.\n", | |
"try:\n", | |
" descriptions = pl.read_parquet(\"music_descriptions.parquet\")\n", | |
"except FileNotFoundError:\n", | |
" pass\n", | |
"\n", | |
"# Generate a lot of music descriptions.\n", | |
"messages = [\n", | |
" {\"role\": \"system\", \"content\": \"You are a music reviewer who is specific, brief and accurate. You work with helping people find music.\"},\n", | |
" {\"role\": \"user\", \"content\": \"Imagine any random piece of popular music and describe how it sounds in passive voice. Mention instruments, genres and vibes. Don't mention titles or artists.\"},\n", | |
"]\n", | |
"for _ in tqdm.trange(100, desc=\"Generating descriptions\"):\n", | |
" descriptions = text_generator(\n", | |
" messages,\n", | |
" num_return_sequences=100,\n", | |
" return_full_text=False,\n", | |
" do_sample=True,\n", | |
" num_beams=1,\n", | |
" max_new_tokens=32,\n", | |
" )\n", | |
"\n", | |
"\n", | |
"descriptions = pl.DataFrame(descriptions)\n", | |
"\n", | |
"# Save the descriptions to file.\n", | |
"descriptions.write_parquet(\"music_descriptions.parquet\")\n", | |
"\n", | |
"# Show a few examples.\n", | |
"descriptions.sample(n=3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"num_dimensions = clap_model.config.projection_dim\n", | |
"index = faiss.IndexHNSWFlat(num_dimensions)" | |
], | |
"metadata": { | |
"id": "WFJIhAWJ0unl" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Tokenize text descriptions.\n", | |
"inputs = clap_processor(text=descriptions[\"generated_text\"].to_list(), return_tensors=\"pt\", padding=True)\n", | |
"\n", | |
"# Populate local vector database.\n", | |
"batch_size = 8\n", | |
"for i in tqdm.trange(0, len(inputs[\"input_ids\"]), batch_size, desc=\"Indexing descriptions\"):\n", | |
" input_ids = inputs[\"input_ids\"][i:i + batch_size]\n", | |
" attention_mask = inputs[\"attention_mask\"][i:i + batch_size]\n", | |
"\n", | |
" # Embed the tokens.\n", | |
" text_embeddings = clap_model.get_text_features(input_ids, attention_mask)\n", | |
"\n", | |
" # Add embeddings to the index.\n", | |
" index.add(text_embeddings.numpy(force=True))" | |
], | |
"metadata": { | |
"id": "-vzUKMwoIAyq" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Load an example audio file.\n", | |
"audio_file = lr.example(\"trumpet\")\n", | |
"waveform, samplerate = lr.load(audio_file, sr=clap_processor.feature_extractor.sampling_rate)\n", | |
"\n", | |
"# Compute audio embedding.\n", | |
"inputs = clap_processor(audios=waveform, return_tensors=\"pt\", sampling_rate=clap_processor.feature_extractor.sampling_rate)\n", | |
"audio_embedding = clap_model.get_audio_features(**inputs)\n", | |
"audio_embedding.shape" | |
], | |
"metadata": { | |
"id": "MHxCjhxEHKmd" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Search for similar embeddings.\n", | |
"max_results = 1000\n", | |
"similarities, neighbor_ids = index.search(audio_embedding.numpy(force=True), k=max_results)\n", | |
"similarities.shape" | |
], | |
"metadata": { | |
"id": "2sjJQ04ROh8i" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Lookup underlying text descriptions.\n", | |
"matches = descriptions[neighbor_ids[0][:max_results]].with_columns(pl.Series(\"scores\", similarities[0][:max_results]))\n", | |
"matches.top_k(5, by=\"scores\")" | |
], | |
"metadata": { | |
"id": "0G7HvVXXOv3g" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"sns.histplot(similarities[0], bins=max_results//100);" | |
], | |
"metadata": { | |
"id": "WULzrBDjS-HQ" | |
}, | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment