Skip to content

Instantly share code, notes, and snippets.

@bikash119
Last active September 27, 2024 16:08
Show Gist options
  • Save bikash119/fc986497e0ca3d6a4f78dec6280f34dd to your computer and use it in GitHub Desktop.
Save bikash119/fc986497e0ca3d6a4f78dec6280f34dd to your computer and use it in GitHub Desktop.

synthetic data

πŸ§‘ human (Sep 27, 2024, 05:45 PM)

help me create synthetic data with distilabel

πŸ€– assistant (Sep 27, 2024, 05:45 PM)

Certainly! I'd be happy to help you create synthetic data using Distilabel. Let's walk through the process of setting up a simple pipeline to generate synthetic data. We'll use the OpenAI API for this example, but you can adjust it based on your preferred LLM provider.

Here's a step-by-step guide to create a basic synthetic data generation pipeline:

  1. First, make sure you have Distilabel installed:
pip install distilabel[openai] --upgrade
  1. Set up your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_api_key_here
  1. Now, let's create a Python script to set up and run the pipeline. We'll call it generate_synthetic_data.py:
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

# Define the pipeline
with Pipeline(
    name="synthetic-data-generation",
    description="A pipeline to generate synthetic data based on prompts",
) as pipeline:
    # Step 1: Load a dataset with prompts
    load_dataset = LoadDataFromHub(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    # Step 2: Generate synthetic data using the prompts
    text_generation = TextGeneration(
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )

    # Connect the steps
    load_dataset >> text_generation

if __name__ == "__main__":
    # Run the pipeline
    synthetic_dataset = pipeline.run(
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
    
    # Save the generated dataset to the Hugging Face Hub
    synthetic_dataset.push_to_hub(repo_id="your-username/synthetic-data-example")
  1. Run the script:
python generate_synthetic_data.py

This script does the following:

  1. It creates a pipeline with two steps:

    • LoadDataFromHub: Loads a dataset with prompts from the Hugging Face Hub.
    • TextGeneration: Generates synthetic data based on the prompts using the OpenAI GPT-3.5-turbo model.
  2. It runs the pipeline, which will:

    • Load the prompts from the specified dataset.
    • Generate synthetic text for each prompt.
  3. Finally, it saves the generated synthetic dataset to your Hugging Face Hub account.

You can customize this pipeline by:

  • Using a different dataset of prompts (change the repo_id in the parameters).
  • Adjusting the generation parameters (e.g., temperature, max_new_tokens).
  • Using a different LLM (e.g., replace OpenAILLM with another supported LLM like AnthropicLLM or MistralAILLM).

Remember to replace "your-username/synthetic-data-example" with your actual Hugging Face username and desired repository name.

This example provides a basic setup for generating synthetic data. Depending on your specific needs, you may want to add more steps to the pipeline, such as data cleaning, formatting, or additional processing of the generated text.

Would you like me to explain any part of this process in more detail or help you customize it further?

πŸ§‘ human (Sep 27, 2024, 05:52 PM)

help me use vLLM for the generation and use instructor to generate structured data as output

πŸ€– assistant (Sep 27, 2024, 05:52 PM)

Certainly! I'd be happy to help you use vLLM for generation and Instructor to generate structured data as output. We'll modify the previous example to incorporate these changes. First, let's set up the environment and then create a new script.

  1. Install the necessary packages:
pip install distilabel[vllm] instructor pydantic --upgrade
  1. Now, let's create a new Python script called generate_structured_data.py. We'll use vLLM to run a local model and Instructor to generate structured data:
from typing import List
from pydantic import BaseModel, Field
from instructor import patch
from vllm import LLM as vLLM_LLM
from distilabel.llms import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

# Define the structure for our synthetic data
class SyntheticDataItem(BaseModel):
    title: str = Field(..., description="A short title for the generated content")
    content: str = Field(..., description="The main body of generated content")
    tags: List[str] = Field(..., description="A list of relevant tags for the content")

# Custom vLLM wrapper with Instructor
class StructuredvLLM(vLLM):
    def __init__(self, model: str):
        super().__init__(model=model)
        self.vllm_model = None

    def load(self):
        if self.vllm_model is None:
            self.vllm_model = vLLM_LLM(model=self.model)
        self.vllm_model = patch(self.vllm_model)

    def generate(self, inputs: List[List[dict]], num_generations: int = 1, **kwargs):
        results = []
        for conversation in inputs:
            prompt = " ".join([msg["content"] for msg in conversation])
            response = self.vllm_model.generate_structured(
                SyntheticDataItem,
                prompt,
                **kwargs
            )
            results.append(response.model_dump_json())
        return results

# Define the pipeline
with Pipeline(
    name="structured-synthetic-data-generation",
    description="A pipeline to generate structured synthetic data based on prompts",
) as pipeline:
    # Step 1: Load a dataset with prompts
    load_dataset = LoadDataFromHub(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    # Step 2: Generate structured synthetic data using the prompts
    text_generation = TextGeneration(
        name="text_generation",
        llm=StructuredvLLM(model="meta-llama/Llama-2-7b-chat-hf"),  # Adjust the model as needed
    )

    # Connect the steps
    load_dataset >> text_generation

if __name__ == "__main__":
    # Run the pipeline
    synthetic_dataset = pipeline.run(
        parameters={
            load_dataset.name: {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_tokens": 512,
                    }
                }
            },
        },
    )
    
    # Save the generated dataset to the Hugging Face Hub
    synthetic_dataset.push_to_hub(repo_id="your-username/structured-synthetic-data-example")

This script does the following:

  1. We define a SyntheticDataItem class using Pydantic, which specifies the structure of our synthetic data.

  2. We create a custom StructuredvLLM class that wraps the vLLM model and uses Instructor to generate structured output.

  3. In the pipeline, we use this custom LLM to generate structured synthetic data based on the input prompts.

  4. The generated data will be in JSON format, following the structure defined in SyntheticDataItem.

To run this script:

python generate_structured_data.py

Note:

  • Make sure you have the necessary permissions and resources to run the specified model (in this case, "meta-llama/Llama-2-7b-chat-hf"). You might need to adjust the model name based on what you have available locally or have permission to use.
  • The vLLM integration in Distilabel assumes you're running the model locally. If you need to use a remote vLLM server, you might need to modify the vLLM class in Distilabel or create a custom LLM class that connects to your remote vLLM server.

This setup will generate structured synthetic data with a title, content, and tags for each input prompt. You can further customize the SyntheticDataItem class to include any fields you need in your synthetic data.

Remember to replace "your-username/structured-synthetic-data-example" with your actual Hugging Face username and desired repository name.

Would you like me to explain any part of this process in more detail or help you customize it further?

πŸ§‘ human (Sep 27, 2024, 08:44 PM)

What is distilabel and how should I use it?

πŸ€– assistant (Sep 27, 2024, 08:44 PM)

Distilabel is a framework for generating synthetic data and AI feedback using Large Language Models (LLMs). It's designed to help engineers create fast, reliable, and scalable pipelines for data generation based on verified research papers. Here's an overview of what Distilabel is and how you can use it:

What is Distilabel?

<antArtifact identifier="distilabel-overview" type="text/markdown" title="Distilabel Framework Overview">
# Distilabel Framework Overview

- **Purpose**: Framework for synthetic data generation and AI feedback using LLMs
- **Key Components**:
  1. Pipeline: Orchestrates the execution of Steps
  2. Steps: Connected as nodes in a Direct Acyclic Graph (DAG)
  3. LLM Integrations: Supports various LLM providers (OpenAI, Anthropic, etc.)
  4. Tasks: Predefined operations like TextGeneration, Classification, etc.
- **Features**:
  - Flexible pipeline creation
  - Support for multiple LLM providers
  - Easy dataset loading and exporting
  - Customizable data generation tasks
- **Use Cases**:
  - Creating synthetic datasets for model training
  - Generating AI feedback for model evaluation
  - Building complex data processing pipelines
</antArtifact>

How to use Distilabel:

  1. Installation: Install Distilabel using pip:

    pip install distilabel
    

    You may want to install extras for specific LLM integrations, e.g.:

    pip install distilabel[openai]
    
  2. Create a Pipeline: Define a pipeline that outlines your data generation process. This typically involves:

    • Loading a dataset (if needed)
    • Defining generation tasks
    • Connecting steps in the pipeline
  3. Configure LLMs: Choose and configure the LLM you want to use for generation. Distilabel supports various providers like OpenAI, Anthropic, and local models via vLLM.

  4. Define Tasks: Use predefined tasks like TextGeneration or create custom tasks for your specific needs.

  5. Run the Pipeline: Execute the pipeline to generate your synthetic data or AI feedback.

  6. Export Results: Save or export the generated data, often to the Hugging Face Hub for easy sharing and versioning.

Here's a basic example of how you might use Distilabel:

<antArtifact identifier="distilabel-basic-example" type="application/vnd.ant.code" language="python" title="Basic Distilabel Usage Example">
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration

# Define the pipeline
with Pipeline(name="simple-text-generation") as pipeline:
    # Load dataset
    load_dataset = LoadDataFromHub(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    # Define text generation task
    text_generation = TextGeneration(
        name="text_generation",
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )

    # Connect steps
    load_dataset >> text_generation

# Run the pipeline
if __name__ == "__main__":
    synthetic_data = pipeline.run(
        parameters={
            load_dataset.name: {
                "repo_id": "example-dataset/prompts",
                "split": "train",
            },
            text_generation.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_tokens": 100,
                    }
                }
            },
        },
    )
    
    # Save the generated dataset
    synthetic_data.push_to_hub(repo_id="your-username/synthetic-data-example")
</antArtifact>

This example demonstrates loading a dataset, generating text using OpenAI's GPT-3.5-turbo, and saving the results to the Hugging Face Hub.

To use Distilabel effectively:

  1. Understand your data generation needs
  2. Choose appropriate LLMs and tasks
  3. Design your pipeline to reflect your desired data flow
  4. Experiment with different configurations and parameters
  5. Validate and iterate on your generated data

Distilabel is particularly useful for creating large, diverse datasets for training and evaluating AI models, especially in scenarios where manual data creation would be time-consuming or impractical.

Would you like me to elaborate on any specific aspect of using Distilabel?

πŸ§‘ human (Sep 27, 2024, 08:47 PM)

what are the key apis of distilabel to generate sythentic data?

πŸ€– assistant (Sep 27, 2024, 08:47 PM)

To generate synthetic data with Distilabel, there are several key APIs and components you'll work with. Let me break these down for you:

<antArtifact identifier="distilabel-key-apis" type="text/markdown" title="Key Distilabel APIs for Synthetic Data Generation">
# Key Distilabel APIs for Synthetic Data Generation

1. Pipeline
   - Main class: `distilabel.pipeline.Pipeline`
   - Purpose: Orchestrates the entire data generation process
   - Key methods:
     - `__init__()`: Initialize the pipeline
     - `run()`: Execute the pipeline

2. Steps
   - Base class: `distilabel.steps.base.Step`
   - Common subclasses:
     - `distilabel.steps.LoadDataFromHub`: Load data from Hugging Face Hub
     - `distilabel.steps.LoadDataFromDicts`: Load data from Python dictionaries
   - Purpose: Define individual operations in the pipeline

3. Tasks
   - Base class: `distilabel.steps.tasks.Task`
   - Common subclasses:
     - `distilabel.steps.tasks.TextGeneration`: Generate text
     - `distilabel.steps.tasks.Classification`: Perform classification
   - Purpose: Define specific data generation or processing tasks

4. LLMs
   - Base class: `distilabel.llms.LLM`
   - Common subclasses:
     - `distilabel.llms.OpenAILLM`: Use OpenAI models
     - `distilabel.llms.AnthropicLLM`: Use Anthropic models
     - `distilabel.llms.vLLM`: Use local models with vLLM
   - Purpose: Interface with different language models

5. Distiset
   - Main class: `distilabel.distiset.Distiset`
   - Purpose: Represent and manipulate the generated dataset
   - Key methods:
     - `push_to_hub()`: Save the dataset to Hugging Face Hub
     - `to_pandas()`: Convert the dataset to a Pandas DataFrame

6. Metrics
   - Module: `distilabel.metrics`
   - Purpose: Evaluate generated data quality
   - Examples: `Rouge`, `Bleu`, `ExactMatch`

</antArtifact>

Now, let me explain how these components work together to generate synthetic data:

  1. Pipeline: The Pipeline is the core of Distilabel. It defines the overall structure of your data generation process. You create a pipeline and add steps to it.

    from distilabel.pipeline import Pipeline
    
    with Pipeline(name="my-synthetic-data-pipeline") as pipeline:
        # Add steps here
  2. Steps and Tasks: Steps are the building blocks of your pipeline. Tasks are specific types of steps designed for data generation or processing. You'll typically use a combination of data loading steps and task steps.

    from distilabel.steps import LoadDataFromHub
    from distilabel.steps.tasks import TextGeneration
    
    load_data = LoadDataFromHub(name="load_data", output_mappings={"prompt": "instruction"})
    generate_text = TextGeneration(name="generate_text", llm=some_llm)
    
    # Connect steps in the pipeline
    load_data >> generate_text
  3. LLMs: LLMs are used within tasks to generate or process data. You'll configure an LLM and pass it to a task.

    from distilabel.llms import OpenAILLM
    
    llm = OpenAILLM(model="gpt-3.5-turbo")
    generate_text = TextGeneration(name="generate_text", llm=llm)
  4. Running the Pipeline: Once your pipeline is set up, you run it to generate data.

    synthetic_data = pipeline.run(
        parameters={
            "load_data": {"repo_id": "example/dataset", "split": "train"},
            "generate_text": {
                "llm": {"generation_kwargs": {"temperature": 0.7, "max_tokens": 100}}
            }
        }
    )
  5. Working with Generated Data: The run() method returns a Distiset, which represents your generated dataset. You can manipulate this data or save it.

    # Save to Hugging Face Hub
    synthetic_data.push_to_hub(repo_id="your-username/synthetic-data")
    
    # Convert to Pandas DataFrame
    df = synthetic_data.to_pandas()
  6. Evaluating Data Quality: You can use metrics to evaluate the quality of your generated data.

    from distilabel.metrics import Rouge
    
    rouge_score = Rouge().compute(synthetic_data)

To effectively use these APIs for synthetic data generation:

  1. Design your pipeline to reflect your data generation needs.
  2. Choose appropriate LLMs and tasks for your use case.
  3. Configure your steps and tasks with suitable parameters.
  4. Run the pipeline and iterate based on the results.
  5. Use metrics to evaluate and improve your generated data.

Remember, the flexibility of Distilabel allows you to create complex pipelines with multiple steps and tasks, enabling sophisticated data generation processes.

Would you like me to elaborate on any specific part of this process or show a more complex example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment