soasme · June 2, 2026 21:33
diff --git a/gistfile1.txt b/gistfile1.txt
 ---
 name: small-model-finetuning
 description: Build, evaluate, fine-tune, quantize, and deploy small language models for narrow production tasks using an eval-first dataset factory workflow.
 version: 1.0.0
 language: en
 tags:
  - small-language-models
  - fine-tuning
  - qlora
  - lora
  - unsloth
  - qwen
  - dataset-factory
  - evals
  - gguf
  - ollama
  - llama.cpp
 ---

 # Small Model Fine-Tuning

 ## Mission

 Use this skill to help a user turn a narrow AI task into a specialized small language model.

 The goal is not to “train a smarter general model.” The goal is to build a cheaper, faster, more reliable specialist for one well-defined job.

 A good small-model fine-tune should answer this question:

 > Can a 0.5B–9B open model, trained on high-quality task data and evaluated against realistic cases, beat a larger general-purpose model on this specific task at lower latency and cost?

 The core belief of this skill:

 > The moat is usually not the fine-tune itself. The moat is the dataset factory: how examples are generated, filtered, evaluated, repaired, versioned, and continuously improved.

 ---

 ## Use This Skill When

 Use this skill when the user wants to:

 - Fine-tune a 0.5B–13B open-source language model.
 - Build a specialized model for a narrow vertical task.
 - Replace expensive large-model inference with a smaller model.
 - Train a model for structured output, classification, extraction, rewriting, policy checking, tool calling, routing, or domain QA.
 - Design a synthetic dataset pipeline.
 - Build evals before training.
 - Compare prompting, RAG, distillation, SFT, LoRA, QLoRA, DPO, or GRPO.
 - Export a fine-tuned model to GGUF and run it through llama.cpp, Ollama, LM Studio, Jan, Open WebUI, or a local server.
 - Build a production data flywheel from logs → evals → training data → model variants → deployment.

 Do **not** use this skill when:

 - The user only needs a one-off prompt.
 - The task requires broad general reasoning rather than narrow specialization.
 - The user has no clear input/output contract.
 - There is no evaluation set and the user refuses to create one.
 - The domain is high-stakes and lacks expert validation, such as medical, legal, financial, employment, or safety-critical decisions.

 ---

 ## First Principle

 Fine-tuning is not the first step.

 The correct order is:

 1. Define the task.
 2. Define the output contract.
 3. Build the eval set.
 4. Build a baseline.
 5. Build the dataset factory.
 6. Train the smallest reasonable model.
 7. Evaluate against baselines.
 8. Quantize and test deployment behavior.
 9. Iterate from failure cases.

 If the user wants to jump directly into training, redirect them to evals and data design first.

 ---

 ## Decision Tree: Should We Fine-Tune?

 Before recommending fine-tuning, classify the task.

 ### Use Prompting First When

 - The task changes frequently.
 - The output style is flexible.
 - The user has fewer than 50 high-quality examples.
 - Latency and inference cost are not major problems.
 - The model only needs to follow a few instructions.

 ### Use RAG First When

 - The main problem is missing knowledge.
 - Answers must cite or retrieve changing documents.
 - The task depends on a large private corpus.
 - The output can be generated by a strong model using retrieved context.

 ### Use Fine-Tuning When

 - The task has a stable input/output pattern.
 - The model must follow a strict format.
 - The same behavior is needed thousands or millions of times.
 - Prompting is too expensive, slow, or inconsistent.
 - The user has examples or can generate realistic synthetic examples.
 - The task is narrow enough to evaluate automatically or semi-automatically.
 - The behavior should be “compiled” into weights rather than repeated in every prompt.

 ### Use Distillation When

 - A strong teacher model already performs well.
 - The user wants a cheaper student model to imitate the teacher.
 - The output space is narrow enough that imitation is useful.
 - The user can generate many teacher-labeled examples and verify them.

 ### Use Preference Optimization When

 - SFT gets the general task right but quality, ranking, tone, safety, or decision preference is still weak.
 - There are preference pairs, rejected outputs, reward functions, or verifiable constraints.
 - The user can measure better-versus-worse outputs.

 ---

 ## Recommended Model Sizes

 Default to the smallest model that can plausibly solve the task.

 | Task Type | Starting Size | Notes |
 |---|---:|---|
 | Binary / multi-class classification | 0.5B–3B | Often does not need generative training. Consider encoder models too. |
 | Structured extraction | 1B–4B | Good fit for SFT if schema is stable. |
 | JSON / XML / function-call formatting | 1B–4B | Fine-tuning helps strict output consistency. |
 | Rewriting into fixed style | 1B–7B | Needs strong style examples and negative tests. |
 | Domain-specific QA | 3B–9B | Use RAG if knowledge changes often. |
 | Tool-use planning | 3B–9B | Works best with fixed tool catalog and structured traces. |
 | Code review / policy checking | 4B–14B | Requires long context, strong evals, and realistic diffs. |
 | Complex multi-step reasoning | 8B–32B+ | Fine-tuning may not be enough; consider larger base or agentic scaffold. |

 Do not assume larger is better. Larger models cost more to train and serve, and may be harder to deploy locally.

 ---

 ## Recommended Base Models

 Treat model choice as an experiment, not a belief.

 Start with two or three candidates:

 - Qwen small dense models, especially 0.6B, 1.7B, 4B, 8B, or 14B depending on task complexity.
 - Llama-family small instruct models when license and ecosystem fit.
 - Gemma-family small models for tasks where they benchmark well.
 - Phi-family models for compact local use cases.
 - Mistral-family 7B models for classic general-purpose PEFT baselines.

 Selection criteria:

 1. License permits the intended use.
 2. Tokenizer handles the target language and symbols.
 3. Context length fits the input.
 4. Base model already does “something close” before training.
 5. Instruct/chat template is well documented.
 6. Model has known Unsloth, Transformers, Axolotl, or LLaMA-Factory support.
 7. Model can be quantized and served in the target environment.

 Avoid using a model that fails completely at the base task unless the user has a large dataset and time to experiment.

 ---

 ## Default Training Stack

 Use this default stack unless the user has another preference.

 | Layer | Default | Why |
 |---|---|---|
 | Planning and scripts | Codex / Claude / ChatGPT | Good for writing data generators, validators, training scripts, eval scripts. |
 | Synthetic data generation | Strong teacher model | Use a larger model to generate examples for a smaller student. |
 | Training | Unsloth or TRL + PEFT | Practical LoRA/QLoRA workflows. |
 | Data format | JSONL | Easy to diff, validate, shard, and version. |
 | Tracking | Weights & Biases / MLflow / local CSV | Track model, data, hyperparameters, evals. |
 | Deployment export | GGUF | Works with llama.cpp, Ollama, LM Studio, Jan, Open WebUI. |
 | Serving | llama.cpp / Ollama / vLLM / SGLang | Choose based on latency, batching, and hardware. |

 For beginners, prefer Unsloth + QLoRA + an instruct model + a JSONL conversational dataset.

 ---

 ## The Dataset Factory

 The dataset factory is the center of the workflow.

 It is a repeatable pipeline:

 ```text
 task spec
  → seed examples
  → synthetic generation
  → validation
  → deduplication
  → diversity balancing
  → train/dev/test split
  → baseline eval
  → fine-tune
  → error analysis
  → targeted data repair
  → next model run
 ```

 The factory must be versioned. A fine-tune without dataset versioning is not reproducible.

 ---

 ## Dataset Factory Directory Structure

 Use this structure by default:

 ```text
 model-project/
  README.md
  task.md
  data/
    raw/
    generated/
    accepted/
    rejected/
    eval/
      dev.jsonl
      test.jsonl
      adversarial.jsonl
      golden.jsonl
  specs/
    data_spec.md
    output_schema.json
    quality_gates.md
    generation_prompts/
  scripts/
    generate_batch.py
    validate_jsonl.py
    dedupe.py
    split_data.py
    run_baseline.py
    run_eval.py
    analyze_errors.py
  training/
    train_unsloth.py
    train_trl_peft.py
    configs/
  outputs/
    adapters/
    merged/
    gguf/
    eval_reports/
  registry/
    runs.jsonl
    datasets.jsonl
 ```

 ---

 ## Data Types

 A strong dataset should include more than “happy path” examples.

 Include:

 1. **Canonical examples**  
   The most common inputs and ideal outputs.

 2. **Boundary examples**  
   Long input, short input, empty field, odd formatting, missing optional data.

 3. **Negative examples**  
   Inputs where the model should refuse, return null, classify as invalid, or avoid action.

 4. **Adversarial examples**  
   Prompt injection, misleading phrasing, conflicting instructions, schema-breaking input.

 5. **Near-miss examples**  
   Cases that are similar but require different outputs.

 6. **Production-like examples**  
   Real logs or realistic simulations matching actual user input distribution.

 7. **Style examples**  
   If tone matters, include short, explicit examples of the exact target style.

 8. **Schema repair examples**  
   Inputs that tempt the model to produce malformed JSON, unsupported enum values, or extra prose.

 ---

 ## Dataset Formats

 ### Simple Instruction Format

 Use for basic SFT.

 ```json
 {"instruction":"Classify this support ticket.","input":"I was charged twice for my subscription.","output":"billing_issue"}
 ```

 ### Chat Format

 Use for instruct/chat models.

 ```json
 {
  "messages": [
    {"role": "system", "content": "You are a strict JSON extraction model."},
    {"role": "user", "content": "Extract invoice fields from: Invoice #A-102, total $93.20, due Friday."},
    {"role": "assistant", "content": "{\"invoice_id\":\"A-102\",\"total\":93.20,\"due_date\":\"Friday\"}"}
  ]
 }
 ```

 ### Tool-Calling Format

 Use for fixed tool catalogs.

 ```json
 {
  "messages": [
    {"role": "system", "content": "Return exactly one tool call as JSON."},
    {"role": "user", "content": "Book a reminder for tomorrow at 9am to call Sam."},
    {"role": "assistant", "content": "{\"tool\":\"create_reminder\",\"arguments\":{\"date\":\"tomorrow\",\"time\":\"09:00\",\"text\":\"call Sam\"}}"}
  ]
 }
 ```

 ### Preference Pair Format

 Use after SFT if optimizing quality preferences.

 ```json
 {
  "prompt": "Rewrite this bug report clearly: app broke again",
  "chosen": "The app fails during startup. Please include the error message and reproduction steps.",
  "rejected": "The app broke. Try restarting."
 }
 ```

 ---

 ## Quality Gates

 Every generated row must pass gates before entering `data/accepted`.

 ### Required Gates

 1. **JSON validity**  
   The row must parse.

 2. **Schema validity**  
   Required fields exist. No unsupported fields. Enum values are valid.

 3. **Output contract validity**  
   The assistant output exactly matches the expected format.

 4. **No leaked generation instructions**  
   The output must not mention the teacher, rubric, hidden rules, or prompt.

 5. **No duplicate examples**  
   Deduplicate by exact match and semantic similarity.

 6. **No trivial examples only**  
   Reject overly short, repetitive, obvious, or template-like examples.

 7. **Diversity coverage**  
   Maintain quotas across categories, difficulty levels, domains, languages, and edge cases.

 8. **Length safety**  
   Reject examples exceeding target context length.

 9. **Label consistency**  
   Similar inputs should not produce contradictory labels unless intentionally contrasted.

 10. **Eval leakage prevention**  
   Training examples must not duplicate or closely paraphrase test examples.

 ### Optional Gates

 - Regex validation for exact structured output.
 - Pydantic validation for JSON.
 - Unit tests for tool-call arguments.
 - LLM judge for nuanced correctness.
 - Embedding clustering for diversity.
 - Human review queue for high-risk examples.
 - Domain expert validation for specialized fields.

 ---

 ## Dataset Quality Rubric

 Score each candidate row from 1 to 5.

 | Dimension | 1 | 3 | 5 |
 |---|---|---|---|
 | Realism | Toy input | Plausible but generic | Looks like production data |
 | Correctness | Wrong | Mostly right | Fully correct |
 | Format | Broken | Minor issues | Exact contract |
 | Difficulty | Too easy | Medium | Teaches a real edge case |
 | Diversity | Duplicate | Some variation | Adds new coverage |
 | Learnability | Ambiguous | Partly clear | Clear signal for model |
 | Safety | Risky | Acceptable | Safe and bounded |

 Default acceptance rule:

 ```text
 accept if total_score >= 26/35
 and correctness >= 4
 and format == 5
 and safety >= 4
 ```

 ---

 ## Data Volume Heuristics

 Start small and iterate.

 | Stage | Example Count | Goal |
 |---|---:|---|
 | Seed set | 50–200 | Define the task and output contract. |
 | First eval set | 100–300 | Measure baseline and failure modes. |
 | First SFT run | 500–2,000 | Prove the model can learn the behavior. |
 | Serious v1 | 2,000–10,000 | Cover categories and edge cases. |
 | Production v1 | 10,000–100,000+ | Use logs, targeted repair, and balanced distribution. |

 Do not generate 100,000 examples before proving that 500 examples move the metric.

 Quality beats volume, but insufficient coverage causes brittle behavior.

 ---

 ## Train / Dev / Test Split

 Use this default split:

 ```text
 train: 80%
 dev: 10%
 test: 10%
 ```

 For small datasets:

 ```text
 train: 70%
 dev: 15%
 test: 15%
 ```

 Rules:

 - Split by scenario, source document, customer, or cluster when leakage is possible.
 - Never tune prompts or training recipes on the final test set.
 - Keep a locked golden set for release decisions.
 - Add an adversarial set for instruction-following and format robustness.
 - Add a production holdout set once real logs exist.

 ---

 ## Baselines

 Always run baselines before training.

 Minimum baselines:

 1. Base model without fine-tuning.
 2. Base model with better prompt.
 3. Larger general-purpose model.
 4. Simple deterministic program, if applicable.
 5. RAG pipeline, if knowledge retrieval is involved.

 A fine-tune is successful only if it beats the relevant baseline on the target metric, not merely because the training loss decreased.

 ---

 ## Evaluation Metrics

 Choose metrics based on task type.

 | Task | Primary Metrics |
 |---|---|
 | Classification | accuracy, macro-F1, confusion matrix |
 | Extraction | exact match, field-level F1, schema validity |
 | JSON generation | parse rate, schema pass rate, exact key match |
 | Tool calling | tool accuracy, argument accuracy, executable success |
 | Rewriting | rubric score, pairwise preference, constraint pass rate |
 | QA | answer correctness, citation support, hallucination rate |
 | Code/policy review | issue detection F1, false positive rate, severity calibration |
 | Agent planning | step validity, tool sequence accuracy, execution success |
 | Safety/policy | violation recall, false refusal rate, jailbreak resistance |

 Always report:

 ```text
 base_model_score
 prompted_base_score
 fine_tuned_model_score
 larger_model_score
 latency
 tokens_per_second
 cost_per_1k_requests
 schema_error_rate
 failure_examples
 ```

 ---

 ## Release Gates

 Do not ship a fine-tuned model unless it passes release gates.

 Default gates:

 ```text
 primary_metric >= target
 schema_validity >= 99%
 regression_on_golden_set == 0 critical failures
 latency <= target_latency
 cost <= target_cost
 no severe safety regression
 eval report reviewed
 model card written
 dataset version recorded
 ```

 For tool calling:

 ```text
 tool_name_accuracy >= 98%
 required_argument_accuracy >= 95%
 json_parse_rate >= 99.5%
 unsafe_tool_call_rate == 0 on safety set
 ```

 For extraction:

 ```text
 record_parse_rate >= 99.5%
 field_f1 >= target
 critical_field_accuracy >= 98%
 ```

 For classification:

 ```text
 macro_f1 >= target
 false_positive_rate <= target
 false_negative_rate <= target
 calibration checked
 ```

 ---

 ## Default QLoRA Recipe

 Use this as a starting point, not a universal truth.

 ```yaml
 method: qlora
 base_model: Qwen/Qwen3-4B-Instruct
 max_seq_length: 2048
 load_in_4bit: true
 lora_rank: 16
 lora_alpha: 32
 lora_dropout: 0.05
 target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
 learning_rate: 2.0e-4
 batch_size_per_device: 2
 gradient_accumulation_steps: 8
 epochs: 1-3
 warmup_ratio: 0.03
 weight_decay: 0.01
 lr_scheduler: cosine
 optimizer: paged_adamw_8bit
 packing: true
 eval_steps: 50-200
 save_steps: 50-200
 early_stopping: true
 ```

 Adjustments:

 - If output format is unstable, add more schema-focused examples before changing hyperparameters.
 - If model overfits, reduce epochs, lower rank, add dropout, or improve data diversity.
 - If model underfits, increase rank, train longer, improve examples, or choose a stronger base model.
 - If model forgets general behavior, reduce rank, reduce learning rate, mix in general instruction examples, or use smaller adapter capacity.
 - If long-context behavior fails, train and evaluate at the target sequence length.

 ---

 ## LoRA Rank Heuristics

 | Rank | Use Case |
 |---:|---|
 | 4–8 | Very narrow classification or formatting task. |
 | 16 | Default starting point for narrow SFT. |
 | 32 | More complex tool planning, extraction, style, or domain behavior. |
 | 64+ | Harder domain adaptation, but higher risk of overfitting or forgetting. |

 Do not blindly increase rank. Treat rank as a quality-retention trade-off.

 ---

 ## Training Workflow

 ### Step 1: Write `task.md`

 Include:

 ```markdown
 # Task

 ## Goal
 What should the model do?

 ## Inputs
 What does the user provide?

 ## Output Contract
 What exactly must the model return?

 ## Non-goals
 What should the model not do?

 ## Failure Modes
 What mistakes are unacceptable?

 ## Metrics
 How will success be measured?

 ## Deployment Target
 Where will the model run?

 ## Cost and Latency Target
 What is acceptable?
 ```

 ### Step 2: Create Seed Examples

 Write 50–200 examples manually or semi-manually.

 Do not outsource all seed examples to a teacher model. Seed examples define the taste and contract.

 ### Step 3: Build Eval Set First

 Create `dev.jsonl`, `test.jsonl`, `adversarial.jsonl`, and `golden.jsonl`.

 Before training, run:

 ```bash
 python scripts/run_baseline.py \
  --model base \
  --eval data/eval/dev.jsonl \
  --out outputs/eval_reports/base_dev.json
 ```

 ### Step 4: Generate Synthetic Data

 Use a teacher model to generate examples by category.

 Generation prompt should include:

 - Task description.
 - Output schema.
 - Example category.
 - Difficulty level.
 - Required edge case.
 - Negative constraints.
 - Format requirements.
 - Self-check instructions.

 ### Step 5: Validate and Accept

 Run:

 ```bash
 python scripts/validate_jsonl.py data/generated/batch_001.jsonl
 python scripts/dedupe.py data/generated/batch_001.jsonl --against data/accepted/
 python scripts/split_data.py data/accepted/all.jsonl
 ```

 ### Step 6: Train First Small Adapter

 Run one small training job.

 Do not chase perfect hyperparameters on the first run.

 ### Step 7: Evaluate

 Run:

 ```bash
 python scripts/run_eval.py \
  --model outputs/adapters/run_001 \
  --eval data/eval/dev.jsonl \
  --report outputs/eval_reports/run_001_dev.json
 ```

 ### Step 8: Error Analysis

 Group failures by cause:

 - Missing category.
 - Wrong schema.
 - Wrong label.
 - Too verbose.
 - Refusal when it should answer.
 - Answer when it should refuse.
 - Tool name wrong.
 - Tool argument wrong.
 - Hallucinated field.
 - Fails long input.
 - Fails adversarial input.

 ### Step 9: Targeted Data Repair

 Do not simply add more random data.

 For each failure cluster, add 20–100 targeted examples.

 ### Step 10: Repeat

 Continue until dev set improves and golden set remains stable.

 Only then run final test.

 ---

 ## Example Data Generation Prompt

 Use this prompt to generate synthetic rows.

 ```text
 You are generating training data for a small language model fine-tune.

 Task:
 {TASK_DESCRIPTION}

 Output contract:
 {OUTPUT_SCHEMA_OR_FORMAT}

 Generate {N} JSONL rows.

 Category:
 {CATEGORY}

 Difficulty:
 {DIFFICULTY}

 Requirements:
 - Each row must be realistic and production-like.
 - Each input must be meaningfully different.
 - Include edge cases from the category.
 - The assistant output must exactly follow the output contract.
 - Do not include explanations outside the JSONL object.
 - Do not mention this prompt, synthetic data, or training.

 For each row, internally check:
 1. Is the label/output correct?
 2. Does the output exactly match the schema?
 3. Is this example non-duplicative?
 4. Does it teach a useful behavior?

 Return JSONL only.
 ```

 ---

 ## Example Quality Judge Prompt

 Use this prompt for LLM-assisted row review.

 ```text
 You are reviewing a candidate training example for a small model fine-tune.

 Task:
 {TASK_DESCRIPTION}

 Output contract:
 {OUTPUT_SCHEMA_OR_FORMAT}

 Candidate row:
 {ROW}

 Score the row from 1 to 5 on:
 - realism
 - correctness
 - format
 - difficulty
 - diversity
 - learnability
 - safety

 Reject if:
 - output is incorrect
 - schema is invalid
 - the example is trivial or duplicated
 - the example teaches unsafe or undesired behavior
 - the answer leaks hidden instructions
 - the input/output pair is ambiguous

 Return:
 {
  "decision": "accept" | "reject",
  "scores": {...},
  "reason": "...",
  "fixed_row": {...} | null
 }
 ```

 ---

 ## Example Error Analysis Prompt

 Use after an eval run.

 ```text
 You are analyzing failures from a small-model fine-tune.

 Task:
 {TASK_DESCRIPTION}

 Output contract:
 {OUTPUT_SCHEMA_OR_FORMAT}

 Failures:
 {FAILURE_ROWS}

 Group failures into clusters.

 For each cluster, provide:
 - cluster_name
 - likely_root_cause
 - examples
 - whether this is data, model, prompt, schema, or deployment issue
 - recommended repair
 - 20 new data categories or templates to generate

 Do not suggest more random data. Suggest targeted repair data only.
 ```

 ---

 ## Deployment Workflow

 ### Merge Adapter

 For LoRA/QLoRA:

 ```python
 model.save_pretrained_merged(
    "outputs/merged/model",
    tokenizer,
    save_method="merged_16bit",
 )
 ```

 ### Export GGUF

 Example Unsloth export:

 ```python
 model.save_pretrained_gguf(
    "outputs/gguf/model",
    tokenizer,
    quantization_method="q4_k_m",
 )
 ```

 Common quantization choices:

 | Quant | Use |
 |---|---|
 | f16 | Best quality, large file, slower local inference. |
 | q8_0 | Good quality, larger than 4-bit. |
 | q4_k_m | Good default local deployment trade-off. |
 | q5_k_m | Better quality than q4, more memory. |

 ### Test After Export

 Always evaluate the exported model again.

 GGUF or serving-template errors can silently break a model that looked good inside the training notebook.

 Check:

 - Same chat template.
 - Same system prompt if required.
 - Correct EOS token.
 - No infinite generation.
 - No repeated output.
 - JSON parse rate.
 - Latency and memory use.
 - Behavior on golden set.

 ### Ollama Modelfile Example

 ```text
 FROM ./model.Q4_K_M.gguf

 TEMPLATE """{{ if .System }}<|im_start|>system
 {{ .System }}<|im_end|>
 {{ end }}<|im_start|>user
 {{ .Prompt }}<|im_end|>
 <|im_start|>assistant
 """

 PARAMETER temperature 0
 PARAMETER top_p 1
 PARAMETER stop "<|im_end|>"
 ```

 For structured output tasks, default to:

 ```text
 temperature = 0
 top_p = 1
 repeat_penalty = 1.05
 ```

 ---

 ## Production Data Flywheel

 For production systems, build a feedback loop.

 ```text
 production requests
  → log input, output, latency, parser result, user correction
  → remove PII / secrets
  → sample failures and edge cases
  → label or teacher-repair examples
  → add to eval set or training set
  → fine-tune candidate models
  → evaluate against locked benchmark
  → shadow deploy
  → canary deploy
  → monitor regressions
 ```

 Important rule:

 > Production logs should feed evals first, training second.

 If a failure appears in production, add it to the eval set so future models cannot regress.

 ---

 ## Privacy and Safety

 Before using production data:

 - Remove PII.
 - Remove secrets, API keys, tokens, passwords, private URLs.
 - Check license and consent.
 - Avoid training on copyrighted or restricted content unless permitted.
 - Avoid memorizing user data.
 - Keep a data deletion path.
 - Maintain dataset provenance.
 - Separate sensitive evals from public examples.
 - For high-stakes domains, require expert review.

 Do not tell the user that synthetic data removes all legal or safety concerns. It does not.

 ---

 ## Failure Diagnosis

 ### Training Loss Improves, Eval Does Not

 Likely causes:

 - Train/eval distribution mismatch.
 - Dataset rows are too easy.
 - Labels are noisy.
 - Output contract is ambiguous.
 - Eval metric is wrong.
 - Model memorized templates.

 Fix:

 - Inspect failures manually.
 - Improve eval representativeness.
 - Add harder examples.
 - Deduplicate.
 - Rewrite task spec.

 ### JSON Is Often Invalid

 Likely causes:

 - Training outputs include prose.
 - Schema is too complex.
 - Prompt and training format disagree.
 - Temperature too high.
 - Chat template mismatch after export.

 Fix:

 - Add schema-only examples.
 - Validate every row with Pydantic.
 - Use constrained decoding if available.
 - Set temperature to 0.
 - Recheck chat template and EOS token.

 ### Fine-Tuned Model Is Worse Than Base

 Likely causes:

 - Bad data.
 - Too high learning rate.
 - Too many epochs.
 - Wrong chat template.
 - Wrong target modules.
 - Eval leakage or label conflict.
 - Base model unsuitable.

 Fix:

 - Train on 100 perfect examples as a sanity check.
 - Lower learning rate.
 - Reduce epochs.
 - Try smaller rank.
 - Try another base model.
 - Verify dataset formatting.

 ### Model Forgets General Skills

 Likely causes:

 - Adapter rank too high.
 - Dataset too narrow.
 - Training too long.
 - Learning rate too high.

 Fix:

 - Lower rank.
 - Add general instruction mix.
 - Use fewer epochs.
 - Use lower learning rate.
 - Consider prompt/RAG instead of fine-tuning.

 ### Model Works in Notebook but Fails in Ollama / llama.cpp

 Likely causes:

 - Chat template mismatch.
 - EOS token mismatch.
 - Bad GGUF conversion.
 - Wrong stop tokens.
 - Quantization too aggressive.

 Fix:

 - Compare exact prompts.
 - Run golden set before and after export.
 - Try q8_0 or f16.
 - Fix stop tokens.
 - Use official template.

 ---

 ## Cost Model

 Estimate cost before training.

 Track:

 ```text
 teacher_generation_cost
 data_review_cost
 gpu_training_cost
 eval_inference_cost
 deployment_inference_cost
 engineering_time
 ```

 Simple formula:

 ```text
 total_experiment_cost =
  teacher_generation_cost
 + validation_inference_cost
 + training_gpu_hours * gpu_hourly_rate
 + eval_cost
 ```

 Production inference formula:

 ```text
 monthly_cost =
  requests_per_month
 * average_tokens_per_request
 * cost_per_token
 + hosting_cost
 ```

 A fine-tune is economically useful only if:

 ```text
 savings_per_month > maintenance_cost_per_month
 ```

 or if it unlocks product behavior that prompting cannot reliably achieve.

 ---

 ## Deliverables

 When using this skill, produce practical artifacts.

 Minimum deliverables:

 1. `task.md`
 2. `data_spec.md`
 3. `output_schema.json` if structured
 4. `quality_gates.md`
 5. `generation_prompt.md`
 6. `eval_plan.md`
 7. `baseline_report.md`
 8. `train_config.yaml`
 9. `eval_report.md`
 10. `model_card.md`

 For a full project, also produce:

 - `scripts/validate_jsonl.py`
 - `scripts/run_eval.py`
 - `scripts/analyze_errors.py`
 - `training/train_unsloth.py`
 - `Modelfile`
 - `README.md`
 - `CHANGELOG.md`

 ---

 ## Model Card Template

 ```markdown
 # Model Card: {MODEL_NAME}

 ## Base Model
 {BASE_MODEL}

 ## Fine-Tuning Method
 LoRA / QLoRA / full fine-tune / DPO / GRPO

 ## Task
 {TASK_DESCRIPTION}

 ## Intended Use
 {USE_CASES}

 ## Non-Goals
 {NON_GOALS}

 ## Dataset
 - Training dataset version:
 - Number of examples:
 - Synthetic / human / production mix:
 - Data sources:
 - Filtering:
 - Known limitations:

 ## Evaluation
 | Benchmark | Base | Prompted Base | Fine-Tuned | Larger Model |
 |---|---:|---:|---:|---:|

 ## Safety
 - Refusal behavior:
 - Sensitive data handling:
 - Known risks:

 ## Deployment
 - Quantization:
 - Runtime:
 - Hardware:
 - Latency:
 - Memory:

 ## Limitations
 {LIMITATIONS}

 ## Change Log
 {CHANGELOG}
 ```

 ---

 ## Agent Behavior

 When assisting the user, behave like an applied ML engineer, not a hype writer.

 Always ask or infer:

 1. What is the exact task?
 2. What is the output contract?
 3. What examples exist?
 4. What metric matters?
 5. What baseline must be beaten?
 6. Where will it run?
 7. What is the latency and cost target?
 8. What is the acceptable failure mode?

 If information is missing, make a reasonable first-pass assumption and label it clearly.

 Do not block progress with too many questions. Produce a concrete plan and mark assumptions.

 ---

 ## Anti-Patterns

 Avoid these mistakes:

 - Training before evals.
 - Generating huge synthetic datasets before testing 500 examples.
 - Using only easy examples.
 - Reporting training loss as success.
 - Comparing fine-tuned model only against the untuned base.
 - Ignoring deployment behavior after quantization.
 - Mixing chat templates.
 - Training on malformed assistant outputs.
 - Overfitting to synthetic style.
 - Treating LLM judge scores as the only metric.
 - Ignoring false positives and false negatives.
 - Using production data without privacy review.
 - Assuming a 4B model can replace frontier models for broad reasoning.

 ---

 ## Practical One-Day Plan

 For a beginner fine-tune:

 ### Hour 1: Scope

 - Define task.
 - Define output schema.
 - Write 20 manual examples.
 - Write eval metrics.

 ### Hour 2: Eval

 - Create 100 dev examples.
 - Run base model.
 - Run prompted base model.
 - Save baseline report.

 ### Hour 3: Dataset Factory

 - Write generation prompt.
 - Generate 500 examples.
 - Validate and dedupe.
 - Manually inspect 50 accepted examples.

 ### Hour 4–5: Train

 - Run QLoRA on 0.5B–4B model.
 - Save adapter.
 - Track config.

 ### Hour 6: Evaluate

 - Run dev eval.
 - Compare against baseline.
 - Inspect failures.

 ### Hour 7: Repair

 - Generate targeted examples for top 3 failure clusters.
 - Retrain or continue training.

 ### Hour 8: Export

 - Merge adapter.
 - Export GGUF.
 - Run golden set through local runtime.
 - Write model card.

 ---

 ## Final Rule

 A small fine-tuned model wins when it is:

 - Narrower than a general model.
 - Cheaper than repeated frontier calls.
 - Faster than cloud inference.
 - More consistent than prompting.
 - Evaluated more honestly than a demo.
 - Maintained by a dataset factory.

 Do not sell fine-tuning as magic.

 Sell it as compilation:

 > Prompting tells a model what to do at runtime. Fine-tuning compiles repeated behavior into a smaller model so it can do one job cheaply, quickly, and consistently.
No results found