Created
May 23, 2025 18:45
-
-
Save init27/779d97cb9e3b4fa1a5ae360f70abccb7 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Step 1: Create a Sample from GSM8K | |
Create a Python script named `prepare_gsm8k.py`: | |
```python | |
import pandas as pd | |
import os | |
from datasets import load_dataset | |
# Create directories if they don't exist | |
os.makedirs('data/output', exist_ok=True) | |
# Load GSM8K dataset | |
gsm8k = load_dataset('gsm8k', 'main') | |
# Take 50 samples from the training set | |
samples = gsm8k['train'].select(range(50)) | |
# Create a text file with the questions | |
with open('data/output/gsm8k_sample.txt', 'w') as f: | |
for i, item in enumerate(samples): | |
f.write(f"Problem {i+1}:\n{item['question']}\n\n") | |
print(f"Created sample with {len(samples)} problems") | |
print(f"Sample saved to data/output/gsm8k_sample.txt") | |
``` | |
Run the script: | |
```bash | |
python prepare_gsm8k.py | |
``` | |
## Step 2: Generate Chain-of-Thought Examples | |
Export your API key: | |
```bash | |
export LLAMA_API_KEY="your-together-api-key" | |
``` | |
Run the create command to generate CoT examples: | |
```bash | |
synthetic-data-kit create data/output/gsm8k_sample.txt --type cot | |
``` | |
This will: | |
- Process the sample GSM8K problems | |
- Generate detailed chain-of-thought reasoning for each problem | |
- Save the output to `data/generated/gsm8k_sample_cot_examples.json` | |
## Step 3: Curate Examples by Quality | |
Next, curate the generated examples to keep only high-quality reasoning: | |
```bash | |
synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0 | |
``` | |
This will: | |
- Rate each example for quality on a scale of 1-10 | |
- Filter out examples below the threshold | |
- Save the curated examples to `data/cleaned/gsm8k_sample_cot_examples_cleaned.json` | |
## Step 4: Save in Fine-Tuning Format | |
Finally, save the curated examples in a format suitable for fine-tuning: | |
```bash | |
# Save as ChatML format (for Llama models) | |
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml | |
# Alternatively, save as OpenAI fine-tuning format | |
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f ft | |
``` | |
The output will be saved to `data/final/`. | |
## Step 5: Automate the Entire Pipeline | |
Create a bash script `run_gsm8k_pipeline.sh`: | |
```bash | |
#!/bin/bash | |
# Set your API key | |
export LLAMA_API_KEY="your-together-api-key" | |
# Step 1: Prepare the GSM8K sample | |
python prepare_gsm8k.py | |
# Step 2: Generate CoT examples | |
synthetic-data-kit create data/output/gsm8k_sample.txt --type cot | |
# Step 3: Curate examples | |
synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0 | |
# Step 4: Save in fine-tuning format | |
synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml | |
echo "Pipeline complete! Fine-tuning data ready in data/final/" | |
``` | |
Make it executable and run: | |
```bash | |
chmod +x run_gsm8k_pipeline.sh | |
./run_gsm8k_pipeline.sh | |
``` | |
## Examining the Generated Examples | |
The generated CoT examples follow this structure: | |
```json | |
{ | |
"question": "How many clips did Natalia sell in total during April and May?", | |
"reasoning": "Step 1: First, I need to determine the number of clips Natalia sold in April, which is given as 48.\nStep 2: Then, I calculate the number of clips she sold in May, which is half of the number sold in April.\nStep 3: To find the total number of clips sold, I add the number of clips sold in April and May.", | |
"answer": "72" | |
} | |
``` | |
These examples can be used to fine-tune models to develop better chain-of-thought reasoning capabilities for solving mathematical problems. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment