init27 · May 23, 2025 18:45
diff --git a/gistfile1.txt b/gistfile1.txt
 ## Step 1: Create a Sample from GSM8K

 Create a Python script named `prepare_gsm8k.py`:

 ```python
 import pandas as pd
 import os
 from datasets import load_dataset

 # Create directories if they don't exist
 os.makedirs('data/output', exist_ok=True)

 # Load GSM8K dataset
 gsm8k = load_dataset('gsm8k', 'main')

 # Take 50 samples from the training set
 samples = gsm8k['train'].select(range(50))

 # Create a text file with the questions
 with open('data/output/gsm8k_sample.txt', 'w') as f:
    for i, item in enumerate(samples):
        f.write(f"Problem {i+1}:\n{item['question']}\n\n")

 print(f"Created sample with {len(samples)} problems")
 print(f"Sample saved to data/output/gsm8k_sample.txt")
 ```

 Run the script:

 ```bash
 python prepare_gsm8k.py
 ```

 ## Step 2: Generate Chain-of-Thought Examples

 Export your  API key:

 ```bash
 export LLAMA_API_KEY="your-together-api-key"
 ```

 Run the create command to generate CoT examples:

 ```bash
 synthetic-data-kit create data/output/gsm8k_sample.txt --type cot
 ```

 This will:
 - Process the sample GSM8K problems
 - Generate detailed chain-of-thought reasoning for each problem
 - Save the output to `data/generated/gsm8k_sample_cot_examples.json`

 ## Step 3: Curate Examples by Quality

 Next, curate the generated examples to keep only high-quality reasoning:

 ```bash
 synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0
 ```

 This will:
 - Rate each example for quality on a scale of 1-10
 - Filter out examples below the threshold
 - Save the curated examples to `data/cleaned/gsm8k_sample_cot_examples_cleaned.json`

 ## Step 4: Save in Fine-Tuning Format

 Finally, save the curated examples in a format suitable for fine-tuning:

 ```bash
 # Save as ChatML format (for Llama models)
 synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml

 # Alternatively, save as OpenAI fine-tuning format
 synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f ft
 ```

 The output will be saved to `data/final/`.

 ## Step 5: Automate the Entire Pipeline

 Create a bash script `run_gsm8k_pipeline.sh`:

 ```bash
 #!/bin/bash

 # Set your API key
 export LLAMA_API_KEY="your-together-api-key"

 # Step 1: Prepare the GSM8K sample
 python prepare_gsm8k.py

 # Step 2: Generate CoT examples
 synthetic-data-kit create data/output/gsm8k_sample.txt --type cot

 # Step 3: Curate examples
 synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0

 # Step 4: Save in fine-tuning format
 synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml

 echo "Pipeline complete! Fine-tuning data ready in data/final/"
 ```

 Make it executable and run:

 ```bash
 chmod +x run_gsm8k_pipeline.sh
 ./run_gsm8k_pipeline.sh
 ```

 ## Examining the Generated Examples

 The generated CoT examples follow this structure:

 ```json
 {
  "question": "How many clips did Natalia sell in total during April and May?",
  "reasoning": "Step 1: First, I need to determine the number of clips Natalia sold in April, which is given as 48.\nStep 2: Then, I calculate the number of clips she sold in May, which is half of the number sold in April.\nStep 3: To find the total number of clips sold, I add the number of clips sold in April and May.",
  "answer": "72"
 }
 ```

 These examples can be used to fine-tune models to develop better chain-of-thought reasoning capabilities for solving mathematical problems.
	## Step 1: Create a Sample from GSM8K

	Create a Python script named `prepare_gsm8k.py`:

	```python
	import pandas as pd
	import os
	from datasets import load_dataset

	# Create directories if they don't exist
	os.makedirs('data/output', exist_ok=True)

	# Load GSM8K dataset
	gsm8k = load_dataset('gsm8k', 'main')

	# Take 50 samples from the training set
	samples = gsm8k['train'].select(range(50))

	# Create a text file with the questions
	with open('data/output/gsm8k_sample.txt', 'w') as f:
	for i, item in enumerate(samples):
	f.write(f"Problem {i+1}:\n{item['question']}\n\n")

	print(f"Created sample with {len(samples)} problems")
	print(f"Sample saved to data/output/gsm8k_sample.txt")
	```

	Run the script:

	```bash
	python prepare_gsm8k.py
	```

	## Step 2: Generate Chain-of-Thought Examples

	Export your API key:

	```bash
	export LLAMA_API_KEY="your-together-api-key"
	```

	Run the create command to generate CoT examples:

	```bash
	synthetic-data-kit create data/output/gsm8k_sample.txt --type cot
	```

	This will:
	- Process the sample GSM8K problems
	- Generate detailed chain-of-thought reasoning for each problem
	- Save the output to `data/generated/gsm8k_sample_cot_examples.json`

	## Step 3: Curate Examples by Quality

	Next, curate the generated examples to keep only high-quality reasoning:

	```bash
	synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0
	```

	This will:
	- Rate each example for quality on a scale of 1-10
	- Filter out examples below the threshold
	- Save the curated examples to `data/cleaned/gsm8k_sample_cot_examples_cleaned.json`

	## Step 4: Save in Fine-Tuning Format

	Finally, save the curated examples in a format suitable for fine-tuning:

	```bash
	# Save as ChatML format (for Llama models)
	synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml

	# Alternatively, save as OpenAI fine-tuning format
	synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f ft
	```

	The output will be saved to `data/final/`.

	## Step 5: Automate the Entire Pipeline

	Create a bash script `run_gsm8k_pipeline.sh`:

	```bash
	#!/bin/bash

	# Set your API key
	export LLAMA_API_KEY="your-together-api-key"

	# Step 1: Prepare the GSM8K sample
	python prepare_gsm8k.py

	# Step 2: Generate CoT examples
	synthetic-data-kit create data/output/gsm8k_sample.txt --type cot

	# Step 3: Curate examples
	synthetic-data-kit curate data/generated/gsm8k_sample_cot_examples.json --threshold 8.0

	# Step 4: Save in fine-tuning format
	synthetic-data-kit save-as data/cleaned/gsm8k_sample_cot_examples_cleaned.json -f chatml

	echo "Pipeline complete! Fine-tuning data ready in data/final/"
	```

	Make it executable and run:

	```bash
	chmod +x run_gsm8k_pipeline.sh
	./run_gsm8k_pipeline.sh
	```

	## Examining the Generated Examples

	The generated CoT examples follow this structure:

	```json
	{
	"question": "How many clips did Natalia sell in total during April and May?",
	"reasoning": "Step 1: First, I need to determine the number of clips Natalia sold in April, which is given as 48.\nStep 2: Then, I calculate the number of clips she sold in May, which is half of the number sold in April.\nStep 3: To find the total number of clips sold, I add the number of clips sold in April and May.",
	"answer": "72"
	}
	```

	These examples can be used to fine-tune models to develop better chain-of-thought reasoning capabilities for solving mathematical problems.