Cleaning and validating your fine-tuning dataset is one of the most important steps to ensure effective model training. Poorly formatted or inconsistent data leads to degraded model quality, wasted GPU hours, and hard-to-debug behaviors.
This doc covers:
- Common formatting pitfalls
- Best practices
- A sample validation script
- β Bad: Mixing instruction-style with chat-style.
- β Fix: Maintain a consistent format (e.g., instruction β response).
- β Issues with extra tabs, newlines, HTML entities.
- β Normalize whitespace, remove control characters, and strip out noise.
- β Low-quality or hallucinated completions in the dataset.
- β Use filtered logs from LangSmith/LangFuse or hand-curated examples.
- β Prompts or completions that exceed the model's context length.
- β Truncate or limit inputs using tokenizer checks.
- β Prompts and completions that don't logically go together.
- β Filter examples with vague or incorrect completions.
- β No clear prompt cues for the model.
- β Use system/instruction tags or structured formatting.
- β Eval/test examples or metadata in training set.
- β Separate train/val/test strictly and version datasets.
- Use
.jsonl
format (newline-separated JSON objects). - Keep consistent keys like
"prompt"
and"completion"
. - Visual check: sample 10 examples randomly before running.
- Track dataset source/version with metadata.
- Validate all strings are UTF-8 and properly escaped.
import json
import re
from pathlib import Path
REQUIRED_KEYS = {"prompt", "completion"}
from transformers import AutoTokenizer
TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
MAX_TOKENS = 1024
def validate_line(line, idx):
try:
example = json.loads(line)
except json.JSONDecodeError as e:
return f"Line {idx}: Invalid JSON - {e}"
if not REQUIRED_KEYS.issubset(example):
return f"Line {idx}: Missing keys {REQUIRED_KEYS - example.keys()}"
for key in REQUIRED_KEYS:
val = example[key]
if not isinstance(val, str):
return f"Line {idx}: '{key}' should be a string"
if val.strip() == "":
return f"Line {idx}: '{key}' is empty"
total_tokens = len(TOKENIZER.encode(example['prompt'] + example['completion']))
if total_tokens > MAX_TOKENS:
return f"Line {idx}: exceeds token limit ({total_tokens} > {MAX_TOKENS})"
if "\u" in line or "\x" in line:
return f"Line {idx}: Contains raw escape sequences"
return None
if __name__ == "__main__":
path = Path("fine_tune_data.jsonl")
errors = []
with path.open("r", encoding="utf-8") as f:
for idx, line in enumerate(f, 1):
err = validate_line(line, idx)
if err:
errors.append(err)
if errors:
print("β Found issues:")
for e in errors:
print(" -", e)
else:
print("β
All lines look good!")
Use tools like:
- Datasette for browsing .jsonl
- Jupyter +
pandas.read_json(..., lines=True)
jq
orgrep
for command-line filtering
Good fine-tuning starts with clean, consistent, high-quality data. Run validators early, catch issues before training, and keep iterating!