Skip to content

Instantly share code, notes, and snippets.

@init27
Created May 23, 2025 18:45
Show Gist options
  • Save init27/779d97cb9e3b4fa1a5ae360f70abccb7 to your computer and use it in GitHub Desktop.
Save init27/779d97cb9e3b4fa1a5ae360f70abccb7 to your computer and use it in GitHub Desktop.
## Step 1: Create a Sample from GSM8K
Create a Python script named `prepare_gsm8k.py`:
```python
import pandas as pd
import os
from datasets import load_dataset
# Create directories if they don't exist
os.makedirs('data/output', exist_ok=True)
# Load GSM8K dataset
gsm8k = load_dataset('gsm8k', 'main')
# Take 50 samples from the training set
samples = gsm8k['train'].select(range(50))
# Create a text file with the questions
with open('data/output/gsm8k_sample.txt', 'w') as f:
for i, item in enumerate(samples):
f.write(f"Problem {i+1}:\n{item['question']}\n\n")
print(f"Created sample with {len(samples)} problems")
print(f"Sample saved to data/output/gsm8k_sample.txt")
```
Run the script:
```bash
python prepare_gsm8k.py
```
## Step 2: Generate Chain-of-Thought Examples
Export your API key:
```bash
export LLAMA_API_KEY="your-together-api-key"
```
Run the create command to generate CoT examples:
```bash
synthetic-data-kit create data/output/gsm8k_sample.txt --type cot
```
This will:
- Process the sample GSM8K problems
- Generate detailed chain-of-thought reasoning for each problem
- Save the output to `data/generated/gsm8k_sample_cot_examples.json`
## Step 3: Curate Examples by Quality
Next, curate the generated examples to keep only high-quality reasoning:
```bash
synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0
```
This will:
- Rate each example for quality on a scale of 1-10
- Filter out examples below the threshold
- Save the curated examples to `data/cleaned/gsm8k_sample_cot_examples_cleaned.json`
## Step 4: Save in Fine-Tuning Format
Finally, save the curated examples in a format suitable for fine-tuning:
```bash
# Save as ChatML format (for Llama models)
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml
# Alternatively, save as OpenAI fine-tuning format
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f ft
```
The output will be saved to `data/final/`.
## Step 5: Automate the Entire Pipeline
Create a bash script `run_gsm8k_pipeline.sh`:
```bash
#!/bin/bash
# Set your API key
export LLAMA_API_KEY="your-together-api-key"
# Step 1: Prepare the GSM8K sample
python prepare_gsm8k.py
# Step 2: Generate CoT examples
synthetic-data-kit create data/output/gsm8k_sample.txt --type cot
# Step 3: Curate examples
synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0
# Step 4: Save in fine-tuning format
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml
echo "Pipeline complete! Fine-tuning data ready in data/final/"
```
Make it executable and run:
```bash
chmod +x run_gsm8k_pipeline.sh
./run_gsm8k_pipeline.sh
```
## Examining the Generated Examples
The generated CoT examples follow this structure:
```json
{
"question": "How many clips did Natalia sell in total during April and May?",
"reasoning": "Step 1: First, I need to determine the number of clips Natalia sold in April, which is given as 48.\nStep 2: Then, I calculate the number of clips she sold in May, which is half of the number sold in April.\nStep 3: To find the total number of clips sold, I add the number of clips sold in April and May.",
"answer": "72"
}
```
These examples can be used to fine-tune models to develop better chain-of-thought reasoning capabilities for solving mathematical problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment