This doc helps you decide when and how to include edge cases or rare patterns in your fine-tuning dataset.
- Easier to debug
- Faster convergence
- Avoids overfitting to rare structures
- Dataset < 10k examples
- Narrow domain or early in development
- You need rapid iteration with interpretable results
flowchart TD
A[Start Fine-Tuning] --> B[Use Only Consistent Data]
B --> C{Dataset Stable?}
C -- Yes --> D[Evaluate and Analyze Failures]
C -- No --> B
- Avoid model blind spots
- Improve real-world robustness
- Generalize to fuzzier phrasing or fringe inputs
- After a first stable baseline
- Dataset is larger (> 50k examples)
- Youโre preparing for production use
- You have known critical failure modes (e.g. medical, legal, fintech)
flowchart TD
D[Stable Model] --> E[Introduce Edge Cases]
E --> F[Expand to Ambiguous/Complex Inputs]
F --> G[Measure Improvement & Robustness]
- Phase 1: High-consistency examples only
- Phase 2: Add minor variation (synonyms, rewordings)
- Phase 3: Inject edge cases and ambiguity progressively
graph LR
A1[Phase 1: Consistency] --> A2[Phase 2: Variation]
A2 --> A3[Phase 3: Edge Robustness]
Criteria | Score (0โ3) | Notes |
---|---|---|
Is it common in production? | ||
Would failure be critical? | (e.g. medical, safety, legal) | |
Does it teach generalization? | ||
Is the model currently failing? | Based on logs/tests | |
Can it be grouped & scaled? | Part of a broader pattern? |
โก๏ธ Include if total score โฅ 8
- Ultra rare phrasing with no production impact
- Contrived or noisy edge cases with little structure
- Very ambiguous prompts without clear resolution
These can be added later for robustness testing or eval splits.
import json
def score_example(example):
score = 0
notes = []
if example.get("common_in_production"):
score += 2
notes.append("Seen in prod")
if example.get("failure_is_critical"):
score += 3
notes.append("Critical failure")
if example.get("teaches_generalization"):
score += 2
notes.append("Good generalization")
if example.get("model_fails_here"):
score += 2
notes.append("Current weak spot")
if example.get("is_scalable_pattern"):
score += 1
notes.append("Scalable pattern")
return score, notes
with open("dataset.jsonl") as f:
for line in f:
item = json.loads(line)
score, notes = score_example(item)
if score >= 8:
print(f"INCLUDE ({score}): {item['prompt']} // Notes: {', '.join(notes)}")
else:
print(f"HOLD ({score}): {item['prompt']} // Notes: {', '.join(notes)}")
First, teach the model what "good" looks like. Then, stretch it toward robustness.
Consistency โ Variation โ Edge Robustness.