Skip to content

Instantly share code, notes, and snippets.

@soasme
Created June 2, 2026 21:33
Show Gist options
  • Select an option

  • Save soasme/cbbd996f0da804a464d2c92d105c4d0e to your computer and use it in GitHub Desktop.

Select an option

Save soasme/cbbd996f0da804a464d2c92d105c4d0e to your computer and use it in GitHub Desktop.
small-model-finetuning.md
---
name: small-model-finetuning
description: Build, evaluate, fine-tune, quantize, and deploy small language models for narrow production tasks using an eval-first dataset factory workflow.
version: 1.0.0
language: en
tags:
- small-language-models
- fine-tuning
- qlora
- lora
- unsloth
- qwen
- dataset-factory
- evals
- gguf
- ollama
- llama.cpp
---
# Small Model Fine-Tuning
## Mission
Use this skill to help a user turn a narrow AI task into a specialized small language model.
The goal is not to “train a smarter general model.” The goal is to build a cheaper, faster, more reliable specialist for one well-defined job.
A good small-model fine-tune should answer this question:
> Can a 0.5B–9B open model, trained on high-quality task data and evaluated against realistic cases, beat a larger general-purpose model on this specific task at lower latency and cost?
The core belief of this skill:
> The moat is usually not the fine-tune itself. The moat is the dataset factory: how examples are generated, filtered, evaluated, repaired, versioned, and continuously improved.
---
## Use This Skill When
Use this skill when the user wants to:
- Fine-tune a 0.5B–13B open-source language model.
- Build a specialized model for a narrow vertical task.
- Replace expensive large-model inference with a smaller model.
- Train a model for structured output, classification, extraction, rewriting, policy checking, tool calling, routing, or domain QA.
- Design a synthetic dataset pipeline.
- Build evals before training.
- Compare prompting, RAG, distillation, SFT, LoRA, QLoRA, DPO, or GRPO.
- Export a fine-tuned model to GGUF and run it through llama.cpp, Ollama, LM Studio, Jan, Open WebUI, or a local server.
- Build a production data flywheel from logs → evals → training data → model variants → deployment.
Do **not** use this skill when:
- The user only needs a one-off prompt.
- The task requires broad general reasoning rather than narrow specialization.
- The user has no clear input/output contract.
- There is no evaluation set and the user refuses to create one.
- The domain is high-stakes and lacks expert validation, such as medical, legal, financial, employment, or safety-critical decisions.
---
## First Principle
Fine-tuning is not the first step.
The correct order is:
1. Define the task.
2. Define the output contract.
3. Build the eval set.
4. Build a baseline.
5. Build the dataset factory.
6. Train the smallest reasonable model.
7. Evaluate against baselines.
8. Quantize and test deployment behavior.
9. Iterate from failure cases.
If the user wants to jump directly into training, redirect them to evals and data design first.
---
## Decision Tree: Should We Fine-Tune?
Before recommending fine-tuning, classify the task.
### Use Prompting First When
- The task changes frequently.
- The output style is flexible.
- The user has fewer than 50 high-quality examples.
- Latency and inference cost are not major problems.
- The model only needs to follow a few instructions.
### Use RAG First When
- The main problem is missing knowledge.
- Answers must cite or retrieve changing documents.
- The task depends on a large private corpus.
- The output can be generated by a strong model using retrieved context.
### Use Fine-Tuning When
- The task has a stable input/output pattern.
- The model must follow a strict format.
- The same behavior is needed thousands or millions of times.
- Prompting is too expensive, slow, or inconsistent.
- The user has examples or can generate realistic synthetic examples.
- The task is narrow enough to evaluate automatically or semi-automatically.
- The behavior should be “compiled” into weights rather than repeated in every prompt.
### Use Distillation When
- A strong teacher model already performs well.
- The user wants a cheaper student model to imitate the teacher.
- The output space is narrow enough that imitation is useful.
- The user can generate many teacher-labeled examples and verify them.
### Use Preference Optimization When
- SFT gets the general task right but quality, ranking, tone, safety, or decision preference is still weak.
- There are preference pairs, rejected outputs, reward functions, or verifiable constraints.
- The user can measure better-versus-worse outputs.
---
## Recommended Model Sizes
Default to the smallest model that can plausibly solve the task.
| Task Type | Starting Size | Notes |
|---|---:|---|
| Binary / multi-class classification | 0.5B–3B | Often does not need generative training. Consider encoder models too. |
| Structured extraction | 1B–4B | Good fit for SFT if schema is stable. |
| JSON / XML / function-call formatting | 1B–4B | Fine-tuning helps strict output consistency. |
| Rewriting into fixed style | 1B–7B | Needs strong style examples and negative tests. |
| Domain-specific QA | 3B–9B | Use RAG if knowledge changes often. |
| Tool-use planning | 3B–9B | Works best with fixed tool catalog and structured traces. |
| Code review / policy checking | 4B–14B | Requires long context, strong evals, and realistic diffs. |
| Complex multi-step reasoning | 8B–32B+ | Fine-tuning may not be enough; consider larger base or agentic scaffold. |
Do not assume larger is better. Larger models cost more to train and serve, and may be harder to deploy locally.
---
## Recommended Base Models
Treat model choice as an experiment, not a belief.
Start with two or three candidates:
- Qwen small dense models, especially 0.6B, 1.7B, 4B, 8B, or 14B depending on task complexity.
- Llama-family small instruct models when license and ecosystem fit.
- Gemma-family small models for tasks where they benchmark well.
- Phi-family models for compact local use cases.
- Mistral-family 7B models for classic general-purpose PEFT baselines.
Selection criteria:
1. License permits the intended use.
2. Tokenizer handles the target language and symbols.
3. Context length fits the input.
4. Base model already does “something close” before training.
5. Instruct/chat template is well documented.
6. Model has known Unsloth, Transformers, Axolotl, or LLaMA-Factory support.
7. Model can be quantized and served in the target environment.
Avoid using a model that fails completely at the base task unless the user has a large dataset and time to experiment.
---
## Default Training Stack
Use this default stack unless the user has another preference.
| Layer | Default | Why |
|---|---|---|
| Planning and scripts | Codex / Claude / ChatGPT | Good for writing data generators, validators, training scripts, eval scripts. |
| Synthetic data generation | Strong teacher model | Use a larger model to generate examples for a smaller student. |
| Training | Unsloth or TRL + PEFT | Practical LoRA/QLoRA workflows. |
| Data format | JSONL | Easy to diff, validate, shard, and version. |
| Tracking | Weights & Biases / MLflow / local CSV | Track model, data, hyperparameters, evals. |
| Deployment export | GGUF | Works with llama.cpp, Ollama, LM Studio, Jan, Open WebUI. |
| Serving | llama.cpp / Ollama / vLLM / SGLang | Choose based on latency, batching, and hardware. |
For beginners, prefer Unsloth + QLoRA + an instruct model + a JSONL conversational dataset.
---
## The Dataset Factory
The dataset factory is the center of the workflow.
It is a repeatable pipeline:
```text
task spec
→ seed examples
→ synthetic generation
→ validation
→ deduplication
→ diversity balancing
→ train/dev/test split
→ baseline eval
→ fine-tune
→ error analysis
→ targeted data repair
→ next model run
```
The factory must be versioned. A fine-tune without dataset versioning is not reproducible.
---
## Dataset Factory Directory Structure
Use this structure by default:
```text
model-project/
README.md
task.md
data/
raw/
generated/
accepted/
rejected/
eval/
dev.jsonl
test.jsonl
adversarial.jsonl
golden.jsonl
specs/
data_spec.md
output_schema.json
quality_gates.md
generation_prompts/
scripts/
generate_batch.py
validate_jsonl.py
dedupe.py
split_data.py
run_baseline.py
run_eval.py
analyze_errors.py
training/
train_unsloth.py
train_trl_peft.py
configs/
outputs/
adapters/
merged/
gguf/
eval_reports/
registry/
runs.jsonl
datasets.jsonl
```
---
## Data Types
A strong dataset should include more than “happy path” examples.
Include:
1. **Canonical examples**
The most common inputs and ideal outputs.
2. **Boundary examples**
Long input, short input, empty field, odd formatting, missing optional data.
3. **Negative examples**
Inputs where the model should refuse, return null, classify as invalid, or avoid action.
4. **Adversarial examples**
Prompt injection, misleading phrasing, conflicting instructions, schema-breaking input.
5. **Near-miss examples**
Cases that are similar but require different outputs.
6. **Production-like examples**
Real logs or realistic simulations matching actual user input distribution.
7. **Style examples**
If tone matters, include short, explicit examples of the exact target style.
8. **Schema repair examples**
Inputs that tempt the model to produce malformed JSON, unsupported enum values, or extra prose.
---
## Dataset Formats
### Simple Instruction Format
Use for basic SFT.
```json
{"instruction":"Classify this support ticket.","input":"I was charged twice for my subscription.","output":"billing_issue"}
```
### Chat Format
Use for instruct/chat models.
```json
{
"messages": [
{"role": "system", "content": "You are a strict JSON extraction model."},
{"role": "user", "content": "Extract invoice fields from: Invoice #A-102, total $93.20, due Friday."},
{"role": "assistant", "content": "{\"invoice_id\":\"A-102\",\"total\":93.20,\"due_date\":\"Friday\"}"}
]
}
```
### Tool-Calling Format
Use for fixed tool catalogs.
```json
{
"messages": [
{"role": "system", "content": "Return exactly one tool call as JSON."},
{"role": "user", "content": "Book a reminder for tomorrow at 9am to call Sam."},
{"role": "assistant", "content": "{\"tool\":\"create_reminder\",\"arguments\":{\"date\":\"tomorrow\",\"time\":\"09:00\",\"text\":\"call Sam\"}}"}
]
}
```
### Preference Pair Format
Use after SFT if optimizing quality preferences.
```json
{
"prompt": "Rewrite this bug report clearly: app broke again",
"chosen": "The app fails during startup. Please include the error message and reproduction steps.",
"rejected": "The app broke. Try restarting."
}
```
---
## Quality Gates
Every generated row must pass gates before entering `data/accepted`.
### Required Gates
1. **JSON validity**
The row must parse.
2. **Schema validity**
Required fields exist. No unsupported fields. Enum values are valid.
3. **Output contract validity**
The assistant output exactly matches the expected format.
4. **No leaked generation instructions**
The output must not mention the teacher, rubric, hidden rules, or prompt.
5. **No duplicate examples**
Deduplicate by exact match and semantic similarity.
6. **No trivial examples only**
Reject overly short, repetitive, obvious, or template-like examples.
7. **Diversity coverage**
Maintain quotas across categories, difficulty levels, domains, languages, and edge cases.
8. **Length safety**
Reject examples exceeding target context length.
9. **Label consistency**
Similar inputs should not produce contradictory labels unless intentionally contrasted.
10. **Eval leakage prevention**
Training examples must not duplicate or closely paraphrase test examples.
### Optional Gates
- Regex validation for exact structured output.
- Pydantic validation for JSON.
- Unit tests for tool-call arguments.
- LLM judge for nuanced correctness.
- Embedding clustering for diversity.
- Human review queue for high-risk examples.
- Domain expert validation for specialized fields.
---
## Dataset Quality Rubric
Score each candidate row from 1 to 5.
| Dimension | 1 | 3 | 5 |
|---|---|---|---|
| Realism | Toy input | Plausible but generic | Looks like production data |
| Correctness | Wrong | Mostly right | Fully correct |
| Format | Broken | Minor issues | Exact contract |
| Difficulty | Too easy | Medium | Teaches a real edge case |
| Diversity | Duplicate | Some variation | Adds new coverage |
| Learnability | Ambiguous | Partly clear | Clear signal for model |
| Safety | Risky | Acceptable | Safe and bounded |
Default acceptance rule:
```text
accept if total_score >= 26/35
and correctness >= 4
and format == 5
and safety >= 4
```
---
## Data Volume Heuristics
Start small and iterate.
| Stage | Example Count | Goal |
|---|---:|---|
| Seed set | 50–200 | Define the task and output contract. |
| First eval set | 100–300 | Measure baseline and failure modes. |
| First SFT run | 500–2,000 | Prove the model can learn the behavior. |
| Serious v1 | 2,000–10,000 | Cover categories and edge cases. |
| Production v1 | 10,000–100,000+ | Use logs, targeted repair, and balanced distribution. |
Do not generate 100,000 examples before proving that 500 examples move the metric.
Quality beats volume, but insufficient coverage causes brittle behavior.
---
## Train / Dev / Test Split
Use this default split:
```text
train: 80%
dev: 10%
test: 10%
```
For small datasets:
```text
train: 70%
dev: 15%
test: 15%
```
Rules:
- Split by scenario, source document, customer, or cluster when leakage is possible.
- Never tune prompts or training recipes on the final test set.
- Keep a locked golden set for release decisions.
- Add an adversarial set for instruction-following and format robustness.
- Add a production holdout set once real logs exist.
---
## Baselines
Always run baselines before training.
Minimum baselines:
1. Base model without fine-tuning.
2. Base model with better prompt.
3. Larger general-purpose model.
4. Simple deterministic program, if applicable.
5. RAG pipeline, if knowledge retrieval is involved.
A fine-tune is successful only if it beats the relevant baseline on the target metric, not merely because the training loss decreased.
---
## Evaluation Metrics
Choose metrics based on task type.
| Task | Primary Metrics |
|---|---|
| Classification | accuracy, macro-F1, confusion matrix |
| Extraction | exact match, field-level F1, schema validity |
| JSON generation | parse rate, schema pass rate, exact key match |
| Tool calling | tool accuracy, argument accuracy, executable success |
| Rewriting | rubric score, pairwise preference, constraint pass rate |
| QA | answer correctness, citation support, hallucination rate |
| Code/policy review | issue detection F1, false positive rate, severity calibration |
| Agent planning | step validity, tool sequence accuracy, execution success |
| Safety/policy | violation recall, false refusal rate, jailbreak resistance |
Always report:
```text
base_model_score
prompted_base_score
fine_tuned_model_score
larger_model_score
latency
tokens_per_second
cost_per_1k_requests
schema_error_rate
failure_examples
```
---
## Release Gates
Do not ship a fine-tuned model unless it passes release gates.
Default gates:
```text
primary_metric >= target
schema_validity >= 99%
regression_on_golden_set == 0 critical failures
latency <= target_latency
cost <= target_cost
no severe safety regression
eval report reviewed
model card written
dataset version recorded
```
For tool calling:
```text
tool_name_accuracy >= 98%
required_argument_accuracy >= 95%
json_parse_rate >= 99.5%
unsafe_tool_call_rate == 0 on safety set
```
For extraction:
```text
record_parse_rate >= 99.5%
field_f1 >= target
critical_field_accuracy >= 98%
```
For classification:
```text
macro_f1 >= target
false_positive_rate <= target
false_negative_rate <= target
calibration checked
```
---
## Default QLoRA Recipe
Use this as a starting point, not a universal truth.
```yaml
method: qlora
base_model: Qwen/Qwen3-4B-Instruct
max_seq_length: 2048
load_in_4bit: true
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
learning_rate: 2.0e-4
batch_size_per_device: 2
gradient_accumulation_steps: 8
epochs: 1-3
warmup_ratio: 0.03
weight_decay: 0.01
lr_scheduler: cosine
optimizer: paged_adamw_8bit
packing: true
eval_steps: 50-200
save_steps: 50-200
early_stopping: true
```
Adjustments:
- If output format is unstable, add more schema-focused examples before changing hyperparameters.
- If model overfits, reduce epochs, lower rank, add dropout, or improve data diversity.
- If model underfits, increase rank, train longer, improve examples, or choose a stronger base model.
- If model forgets general behavior, reduce rank, reduce learning rate, mix in general instruction examples, or use smaller adapter capacity.
- If long-context behavior fails, train and evaluate at the target sequence length.
---
## LoRA Rank Heuristics
| Rank | Use Case |
|---:|---|
| 4–8 | Very narrow classification or formatting task. |
| 16 | Default starting point for narrow SFT. |
| 32 | More complex tool planning, extraction, style, or domain behavior. |
| 64+ | Harder domain adaptation, but higher risk of overfitting or forgetting. |
Do not blindly increase rank. Treat rank as a quality-retention trade-off.
---
## Training Workflow
### Step 1: Write `task.md`
Include:
```markdown
# Task
## Goal
What should the model do?
## Inputs
What does the user provide?
## Output Contract
What exactly must the model return?
## Non-goals
What should the model not do?
## Failure Modes
What mistakes are unacceptable?
## Metrics
How will success be measured?
## Deployment Target
Where will the model run?
## Cost and Latency Target
What is acceptable?
```
### Step 2: Create Seed Examples
Write 50–200 examples manually or semi-manually.
Do not outsource all seed examples to a teacher model. Seed examples define the taste and contract.
### Step 3: Build Eval Set First
Create `dev.jsonl`, `test.jsonl`, `adversarial.jsonl`, and `golden.jsonl`.
Before training, run:
```bash
python scripts/run_baseline.py \
--model base \
--eval data/eval/dev.jsonl \
--out outputs/eval_reports/base_dev.json
```
### Step 4: Generate Synthetic Data
Use a teacher model to generate examples by category.
Generation prompt should include:
- Task description.
- Output schema.
- Example category.
- Difficulty level.
- Required edge case.
- Negative constraints.
- Format requirements.
- Self-check instructions.
### Step 5: Validate and Accept
Run:
```bash
python scripts/validate_jsonl.py data/generated/batch_001.jsonl
python scripts/dedupe.py data/generated/batch_001.jsonl --against data/accepted/
python scripts/split_data.py data/accepted/all.jsonl
```
### Step 6: Train First Small Adapter
Run one small training job.
Do not chase perfect hyperparameters on the first run.
### Step 7: Evaluate
Run:
```bash
python scripts/run_eval.py \
--model outputs/adapters/run_001 \
--eval data/eval/dev.jsonl \
--report outputs/eval_reports/run_001_dev.json
```
### Step 8: Error Analysis
Group failures by cause:
- Missing category.
- Wrong schema.
- Wrong label.
- Too verbose.
- Refusal when it should answer.
- Answer when it should refuse.
- Tool name wrong.
- Tool argument wrong.
- Hallucinated field.
- Fails long input.
- Fails adversarial input.
### Step 9: Targeted Data Repair
Do not simply add more random data.
For each failure cluster, add 20–100 targeted examples.
### Step 10: Repeat
Continue until dev set improves and golden set remains stable.
Only then run final test.
---
## Example Data Generation Prompt
Use this prompt to generate synthetic rows.
```text
You are generating training data for a small language model fine-tune.
Task:
{TASK_DESCRIPTION}
Output contract:
{OUTPUT_SCHEMA_OR_FORMAT}
Generate {N} JSONL rows.
Category:
{CATEGORY}
Difficulty:
{DIFFICULTY}
Requirements:
- Each row must be realistic and production-like.
- Each input must be meaningfully different.
- Include edge cases from the category.
- The assistant output must exactly follow the output contract.
- Do not include explanations outside the JSONL object.
- Do not mention this prompt, synthetic data, or training.
For each row, internally check:
1. Is the label/output correct?
2. Does the output exactly match the schema?
3. Is this example non-duplicative?
4. Does it teach a useful behavior?
Return JSONL only.
```
---
## Example Quality Judge Prompt
Use this prompt for LLM-assisted row review.
```text
You are reviewing a candidate training example for a small model fine-tune.
Task:
{TASK_DESCRIPTION}
Output contract:
{OUTPUT_SCHEMA_OR_FORMAT}
Candidate row:
{ROW}
Score the row from 1 to 5 on:
- realism
- correctness
- format
- difficulty
- diversity
- learnability
- safety
Reject if:
- output is incorrect
- schema is invalid
- the example is trivial or duplicated
- the example teaches unsafe or undesired behavior
- the answer leaks hidden instructions
- the input/output pair is ambiguous
Return:
{
"decision": "accept" | "reject",
"scores": {...},
"reason": "...",
"fixed_row": {...} | null
}
```
---
## Example Error Analysis Prompt
Use after an eval run.
```text
You are analyzing failures from a small-model fine-tune.
Task:
{TASK_DESCRIPTION}
Output contract:
{OUTPUT_SCHEMA_OR_FORMAT}
Failures:
{FAILURE_ROWS}
Group failures into clusters.
For each cluster, provide:
- cluster_name
- likely_root_cause
- examples
- whether this is data, model, prompt, schema, or deployment issue
- recommended repair
- 20 new data categories or templates to generate
Do not suggest more random data. Suggest targeted repair data only.
```
---
## Deployment Workflow
### Merge Adapter
For LoRA/QLoRA:
```python
model.save_pretrained_merged(
"outputs/merged/model",
tokenizer,
save_method="merged_16bit",
)
```
### Export GGUF
Example Unsloth export:
```python
model.save_pretrained_gguf(
"outputs/gguf/model",
tokenizer,
quantization_method="q4_k_m",
)
```
Common quantization choices:
| Quant | Use |
|---|---|
| f16 | Best quality, large file, slower local inference. |
| q8_0 | Good quality, larger than 4-bit. |
| q4_k_m | Good default local deployment trade-off. |
| q5_k_m | Better quality than q4, more memory. |
### Test After Export
Always evaluate the exported model again.
GGUF or serving-template errors can silently break a model that looked good inside the training notebook.
Check:
- Same chat template.
- Same system prompt if required.
- Correct EOS token.
- No infinite generation.
- No repeated output.
- JSON parse rate.
- Latency and memory use.
- Behavior on golden set.
### Ollama Modelfile Example
```text
FROM ./model.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER temperature 0
PARAMETER top_p 1
PARAMETER stop "<|im_end|>"
```
For structured output tasks, default to:
```text
temperature = 0
top_p = 1
repeat_penalty = 1.05
```
---
## Production Data Flywheel
For production systems, build a feedback loop.
```text
production requests
→ log input, output, latency, parser result, user correction
→ remove PII / secrets
→ sample failures and edge cases
→ label or teacher-repair examples
→ add to eval set or training set
→ fine-tune candidate models
→ evaluate against locked benchmark
→ shadow deploy
→ canary deploy
→ monitor regressions
```
Important rule:
> Production logs should feed evals first, training second.
If a failure appears in production, add it to the eval set so future models cannot regress.
---
## Privacy and Safety
Before using production data:
- Remove PII.
- Remove secrets, API keys, tokens, passwords, private URLs.
- Check license and consent.
- Avoid training on copyrighted or restricted content unless permitted.
- Avoid memorizing user data.
- Keep a data deletion path.
- Maintain dataset provenance.
- Separate sensitive evals from public examples.
- For high-stakes domains, require expert review.
Do not tell the user that synthetic data removes all legal or safety concerns. It does not.
---
## Failure Diagnosis
### Training Loss Improves, Eval Does Not
Likely causes:
- Train/eval distribution mismatch.
- Dataset rows are too easy.
- Labels are noisy.
- Output contract is ambiguous.
- Eval metric is wrong.
- Model memorized templates.
Fix:
- Inspect failures manually.
- Improve eval representativeness.
- Add harder examples.
- Deduplicate.
- Rewrite task spec.
### JSON Is Often Invalid
Likely causes:
- Training outputs include prose.
- Schema is too complex.
- Prompt and training format disagree.
- Temperature too high.
- Chat template mismatch after export.
Fix:
- Add schema-only examples.
- Validate every row with Pydantic.
- Use constrained decoding if available.
- Set temperature to 0.
- Recheck chat template and EOS token.
### Fine-Tuned Model Is Worse Than Base
Likely causes:
- Bad data.
- Too high learning rate.
- Too many epochs.
- Wrong chat template.
- Wrong target modules.
- Eval leakage or label conflict.
- Base model unsuitable.
Fix:
- Train on 100 perfect examples as a sanity check.
- Lower learning rate.
- Reduce epochs.
- Try smaller rank.
- Try another base model.
- Verify dataset formatting.
### Model Forgets General Skills
Likely causes:
- Adapter rank too high.
- Dataset too narrow.
- Training too long.
- Learning rate too high.
Fix:
- Lower rank.
- Add general instruction mix.
- Use fewer epochs.
- Use lower learning rate.
- Consider prompt/RAG instead of fine-tuning.
### Model Works in Notebook but Fails in Ollama / llama.cpp
Likely causes:
- Chat template mismatch.
- EOS token mismatch.
- Bad GGUF conversion.
- Wrong stop tokens.
- Quantization too aggressive.
Fix:
- Compare exact prompts.
- Run golden set before and after export.
- Try q8_0 or f16.
- Fix stop tokens.
- Use official template.
---
## Cost Model
Estimate cost before training.
Track:
```text
teacher_generation_cost
data_review_cost
gpu_training_cost
eval_inference_cost
deployment_inference_cost
engineering_time
```
Simple formula:
```text
total_experiment_cost =
teacher_generation_cost
+ validation_inference_cost
+ training_gpu_hours * gpu_hourly_rate
+ eval_cost
```
Production inference formula:
```text
monthly_cost =
requests_per_month
* average_tokens_per_request
* cost_per_token
+ hosting_cost
```
A fine-tune is economically useful only if:
```text
savings_per_month > maintenance_cost_per_month
```
or if it unlocks product behavior that prompting cannot reliably achieve.
---
## Deliverables
When using this skill, produce practical artifacts.
Minimum deliverables:
1. `task.md`
2. `data_spec.md`
3. `output_schema.json` if structured
4. `quality_gates.md`
5. `generation_prompt.md`
6. `eval_plan.md`
7. `baseline_report.md`
8. `train_config.yaml`
9. `eval_report.md`
10. `model_card.md`
For a full project, also produce:
- `scripts/validate_jsonl.py`
- `scripts/run_eval.py`
- `scripts/analyze_errors.py`
- `training/train_unsloth.py`
- `Modelfile`
- `README.md`
- `CHANGELOG.md`
---
## Model Card Template
```markdown
# Model Card: {MODEL_NAME}
## Base Model
{BASE_MODEL}
## Fine-Tuning Method
LoRA / QLoRA / full fine-tune / DPO / GRPO
## Task
{TASK_DESCRIPTION}
## Intended Use
{USE_CASES}
## Non-Goals
{NON_GOALS}
## Dataset
- Training dataset version:
- Number of examples:
- Synthetic / human / production mix:
- Data sources:
- Filtering:
- Known limitations:
## Evaluation
| Benchmark | Base | Prompted Base | Fine-Tuned | Larger Model |
|---|---:|---:|---:|---:|
## Safety
- Refusal behavior:
- Sensitive data handling:
- Known risks:
## Deployment
- Quantization:
- Runtime:
- Hardware:
- Latency:
- Memory:
## Limitations
{LIMITATIONS}
## Change Log
{CHANGELOG}
```
---
## Agent Behavior
When assisting the user, behave like an applied ML engineer, not a hype writer.
Always ask or infer:
1. What is the exact task?
2. What is the output contract?
3. What examples exist?
4. What metric matters?
5. What baseline must be beaten?
6. Where will it run?
7. What is the latency and cost target?
8. What is the acceptable failure mode?
If information is missing, make a reasonable first-pass assumption and label it clearly.
Do not block progress with too many questions. Produce a concrete plan and mark assumptions.
---
## Anti-Patterns
Avoid these mistakes:
- Training before evals.
- Generating huge synthetic datasets before testing 500 examples.
- Using only easy examples.
- Reporting training loss as success.
- Comparing fine-tuned model only against the untuned base.
- Ignoring deployment behavior after quantization.
- Mixing chat templates.
- Training on malformed assistant outputs.
- Overfitting to synthetic style.
- Treating LLM judge scores as the only metric.
- Ignoring false positives and false negatives.
- Using production data without privacy review.
- Assuming a 4B model can replace frontier models for broad reasoning.
---
## Practical One-Day Plan
For a beginner fine-tune:
### Hour 1: Scope
- Define task.
- Define output schema.
- Write 20 manual examples.
- Write eval metrics.
### Hour 2: Eval
- Create 100 dev examples.
- Run base model.
- Run prompted base model.
- Save baseline report.
### Hour 3: Dataset Factory
- Write generation prompt.
- Generate 500 examples.
- Validate and dedupe.
- Manually inspect 50 accepted examples.
### Hour 4–5: Train
- Run QLoRA on 0.5B–4B model.
- Save adapter.
- Track config.
### Hour 6: Evaluate
- Run dev eval.
- Compare against baseline.
- Inspect failures.
### Hour 7: Repair
- Generate targeted examples for top 3 failure clusters.
- Retrain or continue training.
### Hour 8: Export
- Merge adapter.
- Export GGUF.
- Run golden set through local runtime.
- Write model card.
---
## Final Rule
A small fine-tuned model wins when it is:
- Narrower than a general model.
- Cheaper than repeated frontier calls.
- Faster than cloud inference.
- More consistent than prompting.
- Evaluated more honestly than a demo.
- Maintained by a dataset factory.
Do not sell fine-tuning as magic.
Sell it as compilation:
> Prompting tells a model what to do at runtime. Fine-tuning compiles repeated behavior into a smaller model so it can do one job cheaply, quickly, and consistently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment