🧹 DATA-SANITIZATION

Overview

Cleaning and validating your fine-tuning dataset is one of the most important steps to ensure effective model training. Poorly formatted or inconsistent data leads to degraded model quality, wasted GPU hours, and hard-to-debug behaviors.

This doc covers:

Common formatting pitfalls
Best practices
A sample validation script

🚨 Common Pitfalls

1. Mismatched Prompt/Completion Style

❌ Bad: Mixing instruction-style with chat-style.
✅ Fix: Maintain a consistent format (e.g., instruction → response).

2. Whitespace & Special Characters

❌ Issues with extra tabs, newlines, HTML entities.
✅ Normalize whitespace, remove control characters, and strip out noise.

3. Uneven Data Quality

❌ Low-quality or hallucinated completions in the dataset.
✅ Use filtered logs from LangSmith/LangFuse or hand-curated examples.

4. Token Overflows

❌ Prompts or completions that exceed the model's context length.
✅ Truncate or limit inputs using tokenizer checks.

5. Label Noise or Intent Mismatch

❌ Prompts and completions that don't logically go together.
✅ Filter examples with vague or incorrect completions.

6. Missing Instructional Signals

❌ No clear prompt cues for the model.
✅ Use system/instruction tags or structured formatting.

7. Data Leakage

❌ Eval/test examples or metadata in training set.
✅ Separate train/val/test strictly and version datasets.

✅ Best Practices

Use .jsonl format (newline-separated JSON objects).
Keep consistent keys like "prompt" and "completion".
Visual check: sample 10 examples randomly before running.
Track dataset source/version with metadata.
Validate all strings are UTF-8 and properly escaped.

🧪 Sample JSONL Validation Script (Python)

import json
import re
from pathlib import Path

REQUIRED_KEYS = {"prompt", "completion"}

from transformers import AutoTokenizer
TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
MAX_TOKENS = 1024


def validate_line(line, idx):
    try:
        example = json.loads(line)
    except json.JSONDecodeError as e:
        return f"Line {idx}: Invalid JSON - {e}"

    if not REQUIRED_KEYS.issubset(example):
        return f"Line {idx}: Missing keys {REQUIRED_KEYS - example.keys()}"

    for key in REQUIRED_KEYS:
        val = example[key]
        if not isinstance(val, str):
            return f"Line {idx}: '{key}' should be a string"
        if val.strip() == "":
            return f"Line {idx}: '{key}' is empty"

    total_tokens = len(TOKENIZER.encode(example['prompt'] + example['completion']))
    if total_tokens > MAX_TOKENS:
        return f"Line {idx}: exceeds token limit ({total_tokens} > {MAX_TOKENS})"

    if "\u" in line or "\x" in line:
        return f"Line {idx}: Contains raw escape sequences"

    return None


if __name__ == "__main__":
    path = Path("fine_tune_data.jsonl")
    errors = []

    with path.open("r", encoding="utf-8") as f:
        for idx, line in enumerate(f, 1):
            err = validate_line(line, idx)
            if err:
                errors.append(err)

    if errors:
        print("❌ Found issues:")
        for e in errors:
            print(" -", e)
    else:
        print("✅ All lines look good!")

🧠 Pro Tip: Visualization Helps

Use tools like:

Datasette for browsing .jsonl
Jupyter + pandas.read_json(..., lines=True)
jq or grep for command-line filtering

Summary

Good fine-tuning starts with clean, consistent, high-quality data. Run validators early, catch issues before training, and keep iterating!

decagondev/DATA-SANITIZATION.md