Created
April 13, 2026 18:00
-
-
Save florianleibert/7cacb007efe455fc806abdd54854c5ef to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| be real: 10 GB is nothing in pretraining terms. it *can* move the needle for linux/bash—but only if you do it right (CPT + careful mixing), otherwise you just induce weird overfitting + forget general skills. | |
| --- | |
| ## what you actually want (afaict) | |
| you’re describing **continued pretraining (CPT)** on top of something like Google DeepMind’s Gemma. | |
| goal: | |
| * inject *distributional bias* toward linux / bash / security | |
| * not wreck general reasoning | |
| --- | |
| ## first: scale reality | |
| 10 GB text ≈ | |
| * ~2–3B tokens (depending on entropy + formatting) | |
| compare: | |
| * base models trained on **trillions** of tokens | |
| → so your data is like a *strong prior tilt*, not a rewrite | |
| this is fine if: | |
| * data is high signal (man pages, real scripts, incident reports) | |
| * you don’t overtrain | |
| --- | |
| ## 3 viable approaches (ranked) | |
| ### 1) CPT (what you asked) | |
| keep training LM objective on your corpus | |
| **BUT** | |
| you *must* mix in general data (5–20%) or you’ll degrade | |
| --- | |
| ### 2) CPT → SFT (better) | |
| 1. CPT on raw corpus | |
| 2. SFT on curated linux tasks (commands, debugging, etc.) | |
| this actually gives you usable behavior | |
| --- | |
| ### 3) skip CPT, do SFT only (often enough) | |
| if your goal is: | |
| * “write bash” | |
| * “debug linux issues” | |
| you can get 80% there with SFT + good dataset | |
| --- | |
| ## hardware (real numbers) | |
| assume: | |
| * Gemma 7B or 13B | |
| * sequence length 2k–4k | |
| * bf16 | |
| --- | |
| ### minimal viable (scrappy, works) | |
| **Gemma 7B CPT** | |
| * 1–2× A100 80GB *or* | |
| * 2–4× H100 | |
| * or even 1× H100 w/ gradient checkpointing | |
| **throughput** | |
| * ~50–150k tokens/sec depending on setup | |
| **time** | |
| * 2–3B tokens → ~6–24 hours | |
| --- | |
| ### sane setup (what you’d actually run) | |
| **Gemma 7B** | |
| * 4× H100 or B200 (since you’re already playing there) | |
| * FSDP or DeepSpeed ZeRO-3 | |
| → finishes in a few hours, stable | |
| --- | |
| ### if you go 13B+ | |
| * 8× H100/B200 preferred | |
| * or accept slow training + more instability | |
| --- | |
| ## config that matters (people screw this up) | |
| **learning rate** | |
| * tiny: `1e-5 – 5e-5` | |
| * you are nudging, not relearning | |
| **epochs** | |
| * 1–2 max | |
| * more = overfit + catastrophic forgetting | |
| **mixing** | |
| * 80–95% your corpus | |
| * 5–20% general text | |
| **tokenization** | |
| * don’t mess this up | |
| * bash/code spacing matters a lot | |
| --- | |
| ## data matters more than compute | |
| your 10GB should be: | |
| good: | |
| * man pages | |
| * real shell histories | |
| * infra repos | |
| * incident writeups | |
| * security advisories (CVE writeups) | |
| bad: | |
| * random scraped tutorials | |
| * SEO garbage | |
| * duplicate stackoverflow spam | |
| dedupe aggressively or you’re just memorizing noise | |
| --- | |
| ## what “better at linux” actually means | |
| CPT improves: | |
| * command recall | |
| * syntax priors | |
| * familiarity with flags / patterns | |
| it does **NOT reliably improve** | |
| * multi-step reasoning | |
| * debugging chains | |
| → that’s SFT / agent loop territory | |
| --- | |
| ## the real play (imo) | |
| if i were you given your infra + B200 experiments: | |
| * do **light CPT (1 epoch)** on your 10GB | |
| * then build **synthetic SFT dataset**: | |
| * prompt: “fix this broken bash script” | |
| * output: corrected + explanation | |
| * then run LoRA SFT | |
| → this compounds way harder than CPT alone | |
| --- | |
| ## blunt take | |
| most people try to “add knowledge” via CPT and end up: | |
| * slightly worse general model | |
| * marginal gains in domain | |
| the edge is: | |
| * tight eval loop (bash execution sandbox) | |
| * generate → run → score → fine-tune again | |
| --- | |
| if you want, i can sketch: | |
| * exact DeepSpeed config | |
| * token/sec estimates on your B200 nodes | |
| * or how to auto-generate 100k high-quality bash training pairs (that’s where the alpha is) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment