Skip to content

Instantly share code, notes, and snippets.

@bikash119
Created September 23, 2024 11:45
Show Gist options
  • Save bikash119/6f9038bc567281928de045e1c74cd856 to your computer and use it in GitHub Desktop.
Save bikash119/6f9038bc567281928de045e1c74cd856 to your computer and use it in GitHub Desktop.
description
Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs.

Installation

You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.

To install the latest release of the package from PyPI you can use the following command:

pip install distilabel --upgrade

Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:

pip install "distilabel @ git+https://github.com/argilla-io/distilabel.git@develop" --upgrade

!!! NOTE We are installing from develop since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace develop with the branch name.

Extras

Additionally, as part of distilabel some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:

  • anthropic: for using models available in Anthropic API via the AnthropicLLM integration.

  • argilla: for exporting the generated datasets to Argilla.

  • cohere: for using models available in Cohere via the CohereLLM integration.

  • groq: for using models available in Groq using groq Python client via the GroqLLM integration.

  • hf-inference-endpoints: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM integration.

  • hf-transformers: for using models available in transformers package via the TransformersLLM integration.

  • litellm: for using LiteLLM to call any LLM using OpenAI format via the LiteLLM integration.

  • llama-cpp: for using llama-cpp-python Python bindings for llama.cpp via the LlamaCppLLM integration.

  • mistralai: for using models available in Mistral AI API via the MistralAILLM integration.

  • ollama: for using Ollama and their available models via OllamaLLM integration.

  • openai: for using OpenAI API models via the OpenAILLM integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM, AzureOpenAILLM, and TogetherLLM.

  • vertexai: for using Google Vertex AI proprietary models via the VertexAILLM integration.

  • vllm: for using vllm serving engine via the vLLM integration.

Recommendations / Notes

The mistralai dependency requires Python 3.9 or higher, so if you're willing to use the distilabel.llms.MistralLLM implementation, you will need to have Python 3.9 or higher.

In some cases like transformers and vllm the installation of flash-attn is recommended if you are using a GPU accelerator, since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the distilabel dependencies.

pip install flash-attn --no-build-isolation

Also, if you are willing to use the llama-cpp-python integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.

Quickstart

To start off, distilabel is a framework for building pipelines for generating synthetic data using LLMs, that defines a [Pipeline][distilabel.pipeline.Pipeline] which orchestrates the execution of the [Step][distilabel.steps.base.Step] subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).

That being said, in this guide we will walk you through the process of creating a simple pipeline that uses the [OpenAILLM][distilabel.llms.OpenAILLM] class to generate text. The [Pipeline][distilabel.pipeline.Pipeline] will load a dataset that contains a column named prompt from the Hugging Face Hub via the step [LoadDataFromHub][distilabel.steps.LoadDataFromHub] and then use the [OpenAILLM][distilabel.llms.OpenAILLM] class to generate text based on the dataset using the [TextGeneration][distilabel.steps.tasks.TextGeneration] task.

from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

with Pipeline(  # (1)
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:  # (2)
    load_dataset = LoadDataFromHub(  # (3)
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    text_generation = TextGeneration(  # (4)
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),  # (5)
    )

    load_dataset >> text_generation  # (6)

if __name__ == "__main__":
    distiset = pipeline.run(  # (7)
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    distiset.push_to_hub(repo_id="distilabel-example")  # (8)
  1. We define a [Pipeline][distilabel.pipeline.Pipeline] with the name simple-text-generation-pipeline and a description A simple text generation pipeline. Note that the name is mandatory and will be used to calculate the cache signature path, so changing the name will change the cache path and will be identified as a different pipeline.

  2. We are using the [Pipeline][distilabel.pipeline.Pipeline] context manager, meaning that every [Step][distilabel.steps.base.Step] subclass that is defined within the context manager will be added to the pipeline automatically.

  3. We define a [LoadDataFromHub][distilabel.steps.LoadDataFromHub] step named load_dataset that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in the pipeline.run method below, but it can also be defined within the class instance via the arg repo_id=.... This step will basically produce output batches with the rows from the dataset, and the column prompt will be mapped to the instruction field.

  4. We define a [TextGeneration][distilabel.steps.tasks.TextGeneration] task named text_generation that will generate text based on the instruction field from the dataset. This task will use the [OpenAILLM][distilabel.llms.OpenAILLM] class with the model gpt-3.5-turbo.

  5. We define the [OpenAILLM][distilabel.llms.OpenAILLM] class with the model gpt-3.5-turbo that will be used by the [TextGeneration][distilabel.steps.tasks.TextGeneration] task. In this case, since the [OpenAILLM][distilabel.llms.OpenAILLM] is used, we assume that the OPENAI_API_KEY environment variable is set, and the OpenAI API will be used to generate the text.

  6. We connect the load_dataset step to the text_generation task using the rshift operator, meaning that the output from the load_dataset step will be used as input for the text_generation task.

  7. We run the pipeline with the parameters for the load_dataset and text_generation steps. The load_dataset step will use the repository distilabel-internal-testing/instruction-dataset-mini and the test split, and the text_generation task will use the generation_kwargs with the temperature set to 0.7 and the max_new_tokens set to 512.

  8. Optionally, we can push the generated [Distiset][distilabel.distiset.Distiset] to the Hugging Face Hub repository distilabel-example. This will allow you to share the generated dataset with others and use it in other pipelines.

Minimal example

distilabel gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset


dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")

with Pipeline() as pipeline:  # (1)
    TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"))  # (2)


if __name__ == "__main__":    
    distiset = pipeline.run(dataset=dataset)  # (3)
    distiset.push_to_hub(repo_id="distilabel-example")
  1. The [Pipeline][distilabel.pipeline.Pipeline] can take no arguments and generate a default name on it's own that will be tracked internally.

  2. Just as with the [Pipeline][distilabel.pipeline.Pipeline], the [Step][distilabel.steps.base.Step]s don't explicitly need a name.

  3. You can generate the dataset as you would normally do with Hugging Face and pass the dataset to the run method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment