Skip to content

Instantly share code, notes, and snippets.

@bikash119
Last active September 27, 2024 16:07
Show Gist options
  • Save bikash119/33aa5f31002b95def5f1b648ddccecb5 to your computer and use it in GitHub Desktop.
Save bikash119/33aa5f31002b95def5f1b648ddccecb5 to your computer and use it in GitHub Desktop.

Generating Synthetic Data with Distilabel

πŸ§‘ human (Sep 27, 2024, 05:42 PM)

help me with code to generate synthetic data using distilabel

πŸ€– assistant (Sep 27, 2024, 05:42 PM)

Certainly, I can help you with code to generate synthetic data using Distilabel. Distilabel is a library for creating labeled datasets using language models. Let's create a simple example to demonstrate how to use it.

<antArtifact identifier="distilabel-synthetic-data" type="application/vnd.ant.code" language="python" title="Synthetic Data Generation with Distilabel">
from distilabel import Pipeline
from distilabel.tasks import OpenEndedGeneration
from distilabel.llm import OpenAILLM
from distilabel.utils import format_messages

# Define your OpenAI API key
api_key = "your-api-key-here"

# Create an instance of the OpenAI language model
llm = OpenAILLM(model="gpt-3.5-turbo", api_key=api_key)

# Define the task (in this case, open-ended generation)
task = OpenEndedGeneration()

# Create a pipeline
pipeline = Pipeline(llm=llm, task=task)

# Define your prompt
prompt = "Generate a short story about a robot learning to paint."

# Generate synthetic data
results = pipeline.run(
    [format_messages(prompt)],
    num_generations=5,  # Number of examples to generate
)

# Print the generated data
for idx, result in enumerate(results, 1):
    print(f"Example {idx}:")
    print(result.data)
    print("\n" + "-"*50 + "\n")
</antArtifact>

This code sets up a simple pipeline using Distilabel to generate synthetic data. Here's a breakdown of what the code does:

  1. We import the necessary modules from Distilabel.
  2. We create an instance of the OpenAI language model (you'll need to replace "your-api-key-here" with your actual OpenAI API key).
  3. We define the task as open-ended generation.
  4. We create a pipeline combining the language model and the task.
  5. We define a prompt for generating short stories about robots learning to paint.
  6. We run the pipeline to generate 5 examples of synthetic data.
  7. Finally, we print out the generated data.

To use this code, you'll need to install Distilabel and its dependencies:

pip install distilabel

Also, make sure you have an OpenAI API key and replace "your-api-key-here" with your actual key.

Would you like me to explain any part of this code in more detail?

πŸ§‘ human (Sep 27, 2024, 05:54 PM)

help me use vLLM for the generation and use instructor to generate structured data as output

πŸ€– assistant (Sep 27, 2024, 05:54 PM)

Certainly! I'll help you modify the code to use vLLM for generation and Instructor to generate structured data as output. This combination will allow for faster inference with vLLM and structured output using Instructor.

<antArtifact identifier="distilabel-synthetic-data" type="application/vnd.ant.code" language="python" title="Synthetic Data Generation with Distilabel, vLLM, and Instructor">
import os
from typing import List
from pydantic import BaseModel, Field
from instructor import OpenAISchema
from distilabel import Pipeline
from distilabel.tasks import OpenEndedGeneration
from distilabel.llm import vLLMLLM
from distilabel.utils import format_messages

# Define the structure for our output data
class Story(OpenAISchema):
    title: str = Field(description="The title of the short story")
    content: str = Field(description="The main content of the short story")
    themes: List[str] = Field(description="A list of themes present in the story")

class StoryDataset(BaseModel):
    stories: List[Story]

# Set up vLLM
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
llm = vLLMLLM(
    model="openai/gpt-3.5-turbo",
    api_key="dummy",  # vLLM doesn't need a real API key
    max_tokens=500,
    temperature=0.7,
)

# Define the task
task = OpenEndedGeneration()

# Create a pipeline
pipeline = Pipeline(llm=llm, task=task)

# Define your prompt
prompt = """Generate a short story about a robot learning to paint. 
The story should have a title, main content, and a list of themes.
Provide the output in the following JSON format:
{
  "title": "Story title",
  "content": "Main story content",
  "themes": ["Theme 1", "Theme 2", "Theme 3"]
}"""

# Generate synthetic data
results = pipeline.run(
    [format_messages(prompt)],
    num_generations=5,  # Number of examples to generate
)

# Parse and structure the generated data
structured_data = StoryDataset(stories=[])

for result in results:
    try:
        story = Story.from_response(result.data)
        structured_data.stories.append(story)
    except Exception as e:
        print(f"Error parsing story: {e}")

# Print the structured data
for idx, story in enumerate(structured_data.stories, 1):
    print(f"Story {idx}:")
    print(f"Title: {story.title}")
    print(f"Content: {story.content}")
    print(f"Themes: {', '.join(story.themes)}")
    print("\n" + "-"*50 + "\n")
</antArtifact>

This updated code uses vLLM for generation and Instructor to create structured output. Here's a breakdown of the changes and additions:

  1. We import necessary modules from Instructor and Pydantic.
  2. We define Story and StoryDataset classes to structure our output data.
  3. We set up vLLM instead of the standard OpenAI model. Note that vLLM doesn't require a real API key, but we still set the OPENAI_API_KEY environment variable to avoid errors.
  4. We modify the prompt to request output in a specific JSON format.
  5. After generating the data, we use Instructor to parse the output into structured Story objects.
  6. Finally, we print the structured data.

To use this code, you'll need to install the additional required packages:

pip install distilabel vllm instructor

Make sure you have the necessary GPU support for vLLM, as it's designed for GPU acceleration.

This setup allows for faster generation with vLLM and provides structured output, making it easier to work with the generated data in your applications.

Would you like me to explain any part of this code in more detail or make any further modifications?

πŸ§‘ human (Sep 27, 2024, 08:44 PM)

What is distilabel and how can I use it?

πŸ€– assistant (Sep 27, 2024, 08:44 PM)

Distilabel is an open-source Python library designed to simplify the process of creating labeled datasets using language models (LLMs). It's particularly useful for generating synthetic data, which can be valuable for training and evaluating machine learning models. Let me explain its key features and how you can use it.

Key features of Distilabel:

  1. Flexibility: It supports various LLMs and can be used for different types of tasks.
  2. Pipeline-based: It uses a pipeline architecture, making it easy to combine different components.
  3. Customizable: You can define custom tasks, prompts, and evaluation metrics.
  4. Integration: It works well with popular ML libraries and frameworks.

Here's how you can use Distilabel:

  1. Installation: First, install Distilabel using pip:

    pip install distilabel
    
  2. Basic Usage:

    • Import necessary modules
    • Set up a language model (LLM)
    • Define a task
    • Create a pipeline
    • Run the pipeline to generate data
  3. Common Use Cases:

    • Generating synthetic datasets
    • Creating labeled data for specific tasks
    • Augmenting existing datasets
    • Evaluating LLM performance on various tasks
  4. Customization:

    • You can create custom tasks by subclassing existing task classes
    • Define custom prompts for specific data generation needs
    • Implement custom evaluation metrics
  5. Integration:

    • Distilabel can be integrated into larger ML workflows
    • It can be used with various LLMs (OpenAI, Hugging Face models, etc.)
    • The generated data can be easily exported for use in other tools

Here's a simple example of how you might use Distilabel:

from distilabel import Pipeline
from distilabel.tasks import OpenEndedGeneration
from distilabel.llm import OpenAILLM
from distilabel.utils import format_messages

llm = OpenAILLM(model="gpt-3.5-turbo", api_key="your-api-key")
task = OpenEndedGeneration()
pipeline = Pipeline(llm=llm, task=task)

prompt = "Generate a short story about space exploration."
results = pipeline.run([format_messages(prompt)], num_generations=3)

for result in results:
    print(result.data)

This script would generate three short stories about space exploration using GPT-3.5-turbo.

Distilabel is particularly useful when you need to:

  • Create large amounts of labeled data quickly
  • Generate diverse datasets for robust model training
  • Evaluate LLM performance on specific tasks
  • Create benchmarks for comparing different models or approaches

Would you like me to go into more detail on any specific aspect of using Distilabel, or perhaps show you a more complex example?

πŸ§‘ human (Sep 27, 2024, 08:47 PM)

what are the key apis of distilabel to generate sythentic data?

πŸ€– assistant (Sep 27, 2024, 08:47 PM)

Certainly! I'll explain the key APIs of Distilabel that are essential for generating synthetic data. These APIs form the core of Distilabel's functionality for data generation.

  1. Pipeline API: The Pipeline class is the central API for data generation in Distilabel. It combines a language model (LLM) and a task to create a data generation pipeline.

    Key methods:

    • __init__(llm, task): Initialize the pipeline with an LLM and a task.
    • run(messages, num_generations): Run the pipeline to generate data.
  2. LLM API: Distilabel provides various LLM classes that serve as interfaces to different language models. The most commonly used ones are:

    • OpenAILLM: For OpenAI models
    • vLLMLLM: For using vLLM (which can run various models more efficiently)
    • HuggingFaceLLM: For Hugging Face models

    Key methods:

    • __init__(): Initialize the LLM with model-specific parameters.
    • generate(): Generate text based on input prompts.
  3. Task API: Tasks define the specific data generation or labeling job. The main task classes include:

    • OpenEndedGeneration: For general text generation
    • InstructionFollowing: For generating responses to instructions
    • QA: For question-answering tasks

    Key methods:

    • get_prompt(): Get the prompt for the task.
    • parse_result(): Parse the generated result.
  4. Utils API: Utility functions that help in formatting messages and handling data:

    • format_messages(): Format input messages for the LLM.

Here's a basic example that illustrates how these APIs work together:

from distilabel import Pipeline
from distilabel.tasks import OpenEndedGeneration
from distilabel.llm import OpenAILLM
from distilabel.utils import format_messages

# Set up the LLM
llm = OpenAILLM(model="gpt-3.5-turbo", api_key="your-api-key")

# Define the task
task = OpenEndedGeneration()

# Create the pipeline
pipeline = Pipeline(llm=llm, task=task)

# Define your prompt
prompt = "Generate a short story about artificial intelligence."

# Run the pipeline
results = pipeline.run(
    [format_messages(prompt)],
    num_generations=3  # Generate 3 examples
)

# Process the results
for idx, result in enumerate(results, 1):
    print(f"Story {idx}:")
    print(result.data)
    print("-" * 50)

This example demonstrates the core workflow:

  1. Set up an LLM (in this case, OpenAI's GPT-3.5-turbo).
  2. Define a task (OpenEndedGeneration for free-form text generation).
  3. Create a pipeline combining the LLM and task.
  4. Format the input prompt using format_messages().
  5. Run the pipeline to generate synthetic data.
  6. Process and display the results.

These APIs provide a flexible framework for generating synthetic data. You can customize various aspects:

  • Use different LLMs by changing the LLM class.
  • Define custom tasks by subclassing existing task classes.
  • Modify prompts to generate specific types of data.
  • Adjust generation parameters (like temperature, max_tokens) in the LLM initialization.

Would you like me to elaborate on any specific part of these APIs or show you a more advanced example of using them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment