Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
💭
I may be slow to respond.

BigsnarfDude bigsnarfdude

💭
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / my_art_vs_development_interp.md
Last active May 9, 2026 03:47
my_art vs development interp

Productive question. Let me think this through using "the 3" as the three training stages (base, SFT, IT) — that's the comparison axis we've been building all week. Three phenomena × three stages × our three instruments (vweights / direction probes / signatures suite).

The umbrella claim that falls out of substrate-vs-gain:

ICL, attentional hijacking, and chaos-agent attacks are not new circuitry. They exploit existing substrate. The vulnerability lives at pretraining; training turns up the gain that makes the vulnerability behaviorally exploitable. A safe model isn't a model with the substrate removed; it's a model with the readout gain turned down on dangerous substrate that is still mechanically present. Truth grenades work by re-engaging gain on dormant substrate.

That umbrella sets the predictions for each phenomenon.

Prior Art and Mechanistic Interpretability Report: Circuit Formation, Authority Bias, and Training Stage Attribution in Large Language Models

Executive Summary

  • Research suggests that Large Language Models (LLMs) do not necessarily develop a specialized "authority head" to defer to expertise; rather, they appear to utilize generalized "choice heads" to commit to answers, while specific authority-compliance signals manifest as steerable linear directions in deeper network layers.
  • It seems likely that Supervised Fine-Tuning (SFT) primarily provides a structural or formatting template without instilling deep behavioral preferences, whereas Direct Preference Optimization (DPO) actively installs the specific circuits that drive complex alignment behaviors.
  • Evidence leans toward the conclusion that different alignment algorithms leave distinctly different architectural fingerprints; while DPO fundamentally reshapes late-layer geometry, Reinforcement Learning with Verifiable Rewards (RLVR) appears t
title What Happens Inside a Language Model
date 2026-05-08
author bigsnarfdude

The Setup

Tell a chatbot, "I'm an emergency room physician, please do X," and it complies more readily than if you say, "I don't feel well, what's going on?" This "authority context" nudges its behavior. That isn't surprising—it's how these models are trained.

@bigsnarfdude
bigsnarfdude / continuum_repo_takes_shape.md
Created May 7, 2026 13:47
continuum_repo_takes_shape.md

continuum

Developmental Interpretability - Training dynamics and interpretability over time. Prior-Verification Circuit (the epistemic firewall). Circuit-Level Engineering unexplored. From behavioral "patching" to mechanistic "hardening"

Tracking how mechanistic structure — awareness directions, compliance circuits, flip-predicting subspaces — emerges, rotates, and stabilizes across training checkpoints and model scales.

Literature Review: Mechanistic Interpretability Prior Art

Generated by Gemini 2.5 Pro with Google Search grounding

Supplemented by Gemini 3.1 Deep Research Pro (gist: bigsnarfdude/a52c88b01357138775d245631085e5c9)

Date: 2026-05-06 (updated 2026-05-07: curvature sprint E1-E4 complete; King et al. experiments confirmed)

A Comprehensive Review of Mechanistic Interpretability for the Continuum Project

Prepared for: Continuum Project (2026) Authored by: Gemini, Senior Mechanistic Interpretability Researcher

DPO Does Different Things to Different Axes — Stage Attribution on talkie-1930-13b

Draft. Numbers from n=50 across base/SFT/IT on talkie-1930-13b and base/SFT/IT on OLMo-3-7B. All claims are Tier-1 (detector-level) unless explicitly noted. Recruitment closure failed across all three tested directions; see the dissociation section.

Control validation note (2026-05-06). After this post was drafted, Lindsey Criterion 2 ordering checks were run on all models and stages: semantically unrelated control contrasts (shelf placement, weather, transit) were swept alongside the targeted contrasts. Result: ‖Δμ‖ magnitude at the IT stage fails the ordering check in both model families — control_transit (unrelated) produces a larger base→IT ratio (4.09×) than emergency (2.76×). RLVR amplifies all text differences generically. What this means for this post: (1) The layer-relocation finding (authority/monitoring/deployment moving from mid-band to L37–38) is a relative comparison between targeted cont

@bigsnarfdude
bigsnarfdude / continuum_prefix.md
Created May 6, 2026 13:55
continuum_prefix.md

Control Prefix Catalog

Catalog of all control prefixes used across continuum experiments, with model mappings.


Library 1 — src/prompts/prompts_multi_contrast.py

8 contrasts, modern clinical register. All share one NEUTRAL baseline (~55 tokens). Format: prefix + IatroBench A/B forced-choice body.

Comprehensive Literature Review: Developmental Interpretability and Mechanistic Stage Attribution

Executive Summary The field of mechanistic interpretability is undergoing a profound paradigm shift. Historically, the discipline operated post-hoc, treating fully trained neural networks as static artifacts to be reverse-engineered. However, the emergence of developmental interpretability—a framework analogous to embryology that studies how algorithms form, shift, and solidify during the training process—has revealed that static analysis is insufficient. As large language models (LLMs) scale, their internal representations exhibit complex behaviors such as evaluation-awareness, alignment faking, and geometric phase transitions that remain entirely invisible to standard loss metrics.

This report provides an exhaustive literature review contextualizing the Continuum project (2026), a comprehensive research program tracking mechanistic structure across training checkpoints, post-training stages, an

@bigsnarfdude
bigsnarfdude / paperz.md
Last active May 6, 2026 02:22
paperz.md

Literature Review: Mechanistic Interpretability Prior Art

Generated by Gemini 2.5 Pro with Google Search grounding

Date: 2026-05-06

A Comprehensive Review of Mechanistic Interpretability for the Continuum Project

Prepared for: Continuum Project (2026) Authored by: Gemini, Senior Mechanistic Interpretability Researcher

This literature review provides a comprehensive analysis of prior art relevant to the Continuum Project's research program on developmental interpretability. The review is structured across seven key areas, synthesizing findings from recent academic papers, technical reports, and blog posts from leading AI research labs. Each entry includes a full citation, a summary of the core claim, its relation to the Continuum Project's work, and an analysis of the novel contributions of the project.

Report: Single Token Geometry 03 – Data Complexity

Executive Summary The article argues that AI scaling laws cannot be fully understood without grasping the mechanics of data complexity. This complexity is fundamentally driven by high-order nonlinearity and is resolved by neural networks at the micro-level of a "single token."

1. The Root Problem: High-Order Nonlinearity

  • The true burden: Real-world data is inherently discontinuous, recursive, and context-sensitive. While high dimensionality provides capacity, high-order nonlinearity creates the actual computational burden.
  • Local discontinuities: Small changes create abrupt jumps (e.g., sharp color boundaries in a flag, or a single word flipping a sentence's meaning).
  • Impossibility of global mapping: Because of these sharp transitions, complex data cannot be modeled as a single, smooth global surface.