description |
---|
Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs. |
You will need to have at least Python 3.9 or higher, up to Python 3.12, since support for the latter is still a work in progress.
To install the latest release of the package from PyPI you can use the following command:
pip install distilabel --upgrade
Alternatively, you may also want to install it from source i.e. the latest unreleased version, you can use the following command:
pip install "distilabel @ git+https://github.com/argilla-io/distilabel.git@develop" --upgrade
!!! NOTE
We are installing from develop
since that's the branch we use to collect all the features, bug fixes, and improvements that will be part of the next release. If you want to install from a specific branch, you can replace develop
with the branch name.
Additionally, as part of distilabel
some extra dependencies are available, mainly to add support for some of the LLM integrations we support. Here's a list of the available extras:
-
anthropic
: for using models available in Anthropic API via theAnthropicLLM
integration. -
argilla
: for exporting the generated datasets to Argilla. -
cohere
: for using models available in Cohere via theCohereLLM
integration. -
groq
: for using models available in Groq usinggroq
Python client via theGroqLLM
integration. -
hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via theInferenceEndpointsLLM
integration. -
hf-transformers
: for using models available in transformers package via theTransformersLLM
integration. -
litellm
: for usingLiteLLM
to call any LLM using OpenAI format via theLiteLLM
integration. -
llama-cpp
: for using llama-cpp-python Python bindings forllama.cpp
via theLlamaCppLLM
integration. -
mistralai
: for using models available in Mistral AI API via theMistralAILLM
integration. -
ollama
: for using Ollama and their available models viaOllamaLLM
integration. -
openai
: for using OpenAI API models via theOpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client asAnyscaleLLM
,AzureOpenAILLM
, andTogetherLLM
. -
vertexai
: for using Google Vertex AI proprietary models via theVertexAILLM
integration. -
vllm
: for using vllm serving engine via thevLLM
integration.
The mistralai
dependency requires Python 3.9 or higher, so if you're willing to use the distilabel.llms.MistralLLM
implementation, you will need to have Python 3.9 or higher.
In some cases like transformers
and vllm
the installation of flash-attn
is recommended if you are using a GPU accelerator, since it will speed up the inference process, but the installation needs to be done separately, as it's not included in the distilabel
dependencies.
pip install flash-attn --no-build-isolation
Also, if you are willing to use the llama-cpp-python
integration for running local LLMs, note that the installation process may get a bit trickier depending on which OS are you using, so we recommend you to read through their Installation section in their docs.
To start off, distilabel
is a framework for building pipelines for generating synthetic data using LLMs, that defines a [Pipeline
][distilabel.pipeline.Pipeline] which orchestrates the execution of the [Step
][distilabel.steps.base.Step] subclasses, and those will be connected as nodes in a Direct Acyclic Graph (DAG).
That being said, in this guide we will walk you through the process of creating a simple pipeline that uses the [OpenAILLM
][distilabel.llms.OpenAILLM] class to generate text. The [Pipeline
][distilabel.pipeline.Pipeline] will load a dataset that contains a column named prompt
from the Hugging Face Hub via the step [LoadDataFromHub
][distilabel.steps.LoadDataFromHub] and then use the [OpenAILLM
][distilabel.llms.OpenAILLM] class to generate text based on the dataset using the [TextGeneration
][distilabel.steps.tasks.TextGeneration] task.
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline( # (1)
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline: # (2)
load_dataset = LoadDataFromHub( # (3)
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
text_generation = TextGeneration( # (4)
name="text_generation",
llm=OpenAILLM(model="gpt-3.5-turbo"), # (5)
)
load_dataset >> text_generation # (6)
if __name__ == "__main__":
distiset = pipeline.run( # (7)
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example") # (8)
-
We define a [
Pipeline
][distilabel.pipeline.Pipeline] with the namesimple-text-generation-pipeline
and a descriptionA simple text generation pipeline
. Note that thename
is mandatory and will be used to calculate thecache
signature path, so changing the name will change the cache path and will be identified as a different pipeline. -
We are using the [
Pipeline
][distilabel.pipeline.Pipeline] context manager, meaning that every [Step
][distilabel.steps.base.Step] subclass that is defined within the context manager will be added to the pipeline automatically. -
We define a [
LoadDataFromHub
][distilabel.steps.LoadDataFromHub] step namedload_dataset
that will load a dataset from the Hugging Face Hub, as provided via runtime parameters in thepipeline.run
method below, but it can also be defined within the class instance via the argrepo_id=...
. This step will basically produce output batches with the rows from the dataset, and the columnprompt
will be mapped to theinstruction
field. -
We define a [
TextGeneration
][distilabel.steps.tasks.TextGeneration] task namedtext_generation
that will generate text based on theinstruction
field from the dataset. This task will use the [OpenAILLM
][distilabel.llms.OpenAILLM] class with the modelgpt-3.5-turbo
. -
We define the [
OpenAILLM
][distilabel.llms.OpenAILLM] class with the modelgpt-3.5-turbo
that will be used by the [TextGeneration
][distilabel.steps.tasks.TextGeneration] task. In this case, since the [OpenAILLM
][distilabel.llms.OpenAILLM] is used, we assume that theOPENAI_API_KEY
environment variable is set, and the OpenAI API will be used to generate the text. -
We connect the
load_dataset
step to thetext_generation
task using thershift
operator, meaning that the output from theload_dataset
step will be used as input for thetext_generation
task. -
We run the pipeline with the parameters for the
load_dataset
andtext_generation
steps. Theload_dataset
step will use the repositorydistilabel-internal-testing/instruction-dataset-mini
and thetest
split, and thetext_generation
task will use thegeneration_kwargs
with thetemperature
set to0.7
and themax_new_tokens
set to512
. -
Optionally, we can push the generated [
Distiset
][distilabel.distiset.Distiset] to the Hugging Face Hub repositorydistilabel-example
. This will allow you to share the generated dataset with others and use it in other pipelines.
distilabel
gives a lot of flexibility to create your pipelines, but to start right away, you can omit a lot of the details and let default values:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration
from datasets import load_dataset
dataset = load_dataset("distilabel-internal-testing/instruction-dataset-mini", split="test")
with Pipeline() as pipeline: # (1)
TextGeneration(llm=InferenceEndpointsLLM(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")) # (2)
if __name__ == "__main__":
distiset = pipeline.run(dataset=dataset) # (3)
distiset.push_to_hub(repo_id="distilabel-example")
-
The [
Pipeline
][distilabel.pipeline.Pipeline] can take no arguments and generate a default name on it's own that will be tracked internally. -
Just as with the [
Pipeline
][distilabel.pipeline.Pipeline], the [Step
][distilabel.steps.base.Step]s don't explicitly need a name. -
You can generate the dataset as you would normally do with Hugging Face and pass the dataset to the run method.