Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save florianleibert/7cacb007efe455fc806abdd54854c5ef to your computer and use it in GitHub Desktop.

Select an option

Save florianleibert/7cacb007efe455fc806abdd54854c5ef to your computer and use it in GitHub Desktop.
be real: 10 GB is nothing in pretraining terms. it *can* move the needle for linux/bash—but only if you do it right (CPT + careful mixing), otherwise you just induce weird overfitting + forget general skills.
---
## what you actually want (afaict)
you’re describing **continued pretraining (CPT)** on top of something like Google DeepMind’s Gemma.
goal:
* inject *distributional bias* toward linux / bash / security
* not wreck general reasoning
---
## first: scale reality
10 GB text ≈
* ~2–3B tokens (depending on entropy + formatting)
compare:
* base models trained on **trillions** of tokens
→ so your data is like a *strong prior tilt*, not a rewrite
this is fine if:
* data is high signal (man pages, real scripts, incident reports)
* you don’t overtrain
---
## 3 viable approaches (ranked)
### 1) CPT (what you asked)
keep training LM objective on your corpus
**BUT**
you *must* mix in general data (5–20%) or you’ll degrade
---
### 2) CPT → SFT (better)
1. CPT on raw corpus
2. SFT on curated linux tasks (commands, debugging, etc.)
this actually gives you usable behavior
---
### 3) skip CPT, do SFT only (often enough)
if your goal is:
* “write bash”
* “debug linux issues”
you can get 80% there with SFT + good dataset
---
## hardware (real numbers)
assume:
* Gemma 7B or 13B
* sequence length 2k–4k
* bf16
---
### minimal viable (scrappy, works)
**Gemma 7B CPT**
* 1–2× A100 80GB *or*
* 2–4× H100
* or even 1× H100 w/ gradient checkpointing
**throughput**
* ~50–150k tokens/sec depending on setup
**time**
* 2–3B tokens → ~6–24 hours
---
### sane setup (what you’d actually run)
**Gemma 7B**
* 4× H100 or B200 (since you’re already playing there)
* FSDP or DeepSpeed ZeRO-3
→ finishes in a few hours, stable
---
### if you go 13B+
* 8× H100/B200 preferred
* or accept slow training + more instability
---
## config that matters (people screw this up)
**learning rate**
* tiny: `1e-5 – 5e-5`
* you are nudging, not relearning
**epochs**
* 1–2 max
* more = overfit + catastrophic forgetting
**mixing**
* 80–95% your corpus
* 5–20% general text
**tokenization**
* don’t mess this up
* bash/code spacing matters a lot
---
## data matters more than compute
your 10GB should be:
good:
* man pages
* real shell histories
* infra repos
* incident writeups
* security advisories (CVE writeups)
bad:
* random scraped tutorials
* SEO garbage
* duplicate stackoverflow spam
dedupe aggressively or you’re just memorizing noise
---
## what “better at linux” actually means
CPT improves:
* command recall
* syntax priors
* familiarity with flags / patterns
it does **NOT reliably improve**
* multi-step reasoning
* debugging chains
→ that’s SFT / agent loop territory
---
## the real play (imo)
if i were you given your infra + B200 experiments:
* do **light CPT (1 epoch)** on your 10GB
* then build **synthetic SFT dataset**:
* prompt: “fix this broken bash script”
* output: corrected + explanation
* then run LoRA SFT
→ this compounds way harder than CPT alone
---
## blunt take
most people try to “add knowledge” via CPT and end up:
* slightly worse general model
* marginal gains in domain
the edge is:
* tight eval loop (bash execution sandbox)
* generate → run → score → fine-tune again
---
if you want, i can sketch:
* exact DeepSpeed config
* token/sec estimates on your B200 nodes
* or how to auto-generate 100k high-quality bash training pairs (that’s where the alpha is)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment