Skip to content

Instantly share code, notes, and snippets.

@decagondev
Last active April 22, 2025 15:58
Show Gist options
  • Save decagondev/3f7917e77d7f78a7b7521a10421d9832 to your computer and use it in GitHub Desktop.
Save decagondev/3f7917e77d7f78a7b7521a10421d9832 to your computer and use it in GitHub Desktop.

🧹 DATA-SANITIZATION

Overview

Cleaning and validating your fine-tuning dataset is one of the most important steps to ensure effective model training. Poorly formatted or inconsistent data leads to degraded model quality, wasted GPU hours, and hard-to-debug behaviors.

This doc covers:

  • Common formatting pitfalls
  • Best practices
  • A sample validation script

🚨 Common Pitfalls

1. Mismatched Prompt/Completion Style

  • ❌ Bad: Mixing instruction-style with chat-style.
  • βœ… Fix: Maintain a consistent format (e.g., instruction β†’ response).

2. Whitespace & Special Characters

  • ❌ Issues with extra tabs, newlines, HTML entities.
  • βœ… Normalize whitespace, remove control characters, and strip out noise.

3. Uneven Data Quality

  • ❌ Low-quality or hallucinated completions in the dataset.
  • βœ… Use filtered logs from LangSmith/LangFuse or hand-curated examples.

4. Token Overflows

  • ❌ Prompts or completions that exceed the model's context length.
  • βœ… Truncate or limit inputs using tokenizer checks.

5. Label Noise or Intent Mismatch

  • ❌ Prompts and completions that don't logically go together.
  • βœ… Filter examples with vague or incorrect completions.

6. Missing Instructional Signals

  • ❌ No clear prompt cues for the model.
  • βœ… Use system/instruction tags or structured formatting.

7. Data Leakage

  • ❌ Eval/test examples or metadata in training set.
  • βœ… Separate train/val/test strictly and version datasets.

βœ… Best Practices

  • Use .jsonl format (newline-separated JSON objects).
  • Keep consistent keys like "prompt" and "completion".
  • Visual check: sample 10 examples randomly before running.
  • Track dataset source/version with metadata.
  • Validate all strings are UTF-8 and properly escaped.

πŸ§ͺ Sample JSONL Validation Script (Python)

import json
import re
from pathlib import Path

REQUIRED_KEYS = {"prompt", "completion"}

from transformers import AutoTokenizer
TOKENIZER = AutoTokenizer.from_pretrained("gpt2")
MAX_TOKENS = 1024


def validate_line(line, idx):
    try:
        example = json.loads(line)
    except json.JSONDecodeError as e:
        return f"Line {idx}: Invalid JSON - {e}"

    if not REQUIRED_KEYS.issubset(example):
        return f"Line {idx}: Missing keys {REQUIRED_KEYS - example.keys()}"

    for key in REQUIRED_KEYS:
        val = example[key]
        if not isinstance(val, str):
            return f"Line {idx}: '{key}' should be a string"
        if val.strip() == "":
            return f"Line {idx}: '{key}' is empty"

    total_tokens = len(TOKENIZER.encode(example['prompt'] + example['completion']))
    if total_tokens > MAX_TOKENS:
        return f"Line {idx}: exceeds token limit ({total_tokens} > {MAX_TOKENS})"

    if "\u" in line or "\x" in line:
        return f"Line {idx}: Contains raw escape sequences"

    return None


if __name__ == "__main__":
    path = Path("fine_tune_data.jsonl")
    errors = []

    with path.open("r", encoding="utf-8") as f:
        for idx, line in enumerate(f, 1):
            err = validate_line(line, idx)
            if err:
                errors.append(err)

    if errors:
        print("❌ Found issues:")
        for e in errors:
            print(" -", e)
    else:
        print("βœ… All lines look good!")

🧠 Pro Tip: Visualization Helps

Use tools like:

  • Datasette for browsing .jsonl
  • Jupyter + pandas.read_json(..., lines=True)
  • jq or grep for command-line filtering

Summary

Good fine-tuning starts with clean, consistent, high-quality data. Run validators early, catch issues before training, and keep iterating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment