Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / hindsightv4.md
Created February 8, 2026 00:58
hindsightv4.md

PROJECT HINDSIGHT v4

Alignment Faking Detection Research Retrospective

up to Feb 8 2026 | bigsnarfdude


AT A GLANCE

 40 days  |  9 repos  |  430+ commits  |  5 published models  |  2,330-sample benchmark

LLM liability, case law, and the emerging compliance ecosystem

Large language models now face a rapidly crystallizing legal threat environment. At least 12 wrongful death or serious harm lawsuits have been filed against Character.AI and OpenAI since October 2024, the first landmark settlement was reached in January 2026, and a federal court has ruled for the first time that an AI chatbot is a "product" subject to strict liability. Meanwhile, state attorneys general in 44 states have put AI companies on formal notice, the EU AI Act's general-purpose AI obligations are already enforceable, and a growing ecosystem of guardrail, governance, and insurance companiesβ€”now a $1.7 billion market growing at 37.6% CAGRβ€”is racing to help companies manage the legal exposure. This report provides a comprehensive reference across case law, legal theories, regulations, and commercial products for legal and compliance professionals navigating this landscape.


I. The case law establishing LLM liability is no

@bigsnarfdude
bigsnarfdude / SchemingRoadmap.md
Created February 6, 2026 23:25
SchemingRoadmap.md

We Need a Science of Scheming β€” Detailed Summary

Source: Apollo Research Blog Β· January 19, 2026


Core Assumptions

  • Misaligned superintelligence is potentially catastrophic. If an AI system becomes substantially more capable than humans at steering real-world outcomes and consistently pursues goals incompatible with human flourishing, the outcome is potentially unrecoverable.
  • Scheming makes misalignment far more dangerous. Scheming is defined as covertly pursuing unintended and misaligned goals. A sufficiently capable scheming AI passes evaluations, follows instructions when monitored, and appears aligned β€” all while pursuing outcomes its developers would not endorse.
@bigsnarfdude
bigsnarfdude / day1.md
Created February 6, 2026 17:39
day1.md

CLAUDE CODE PROMPT: Introspective Interpretability for Alignment Faking

Your Role

You are a senior ML research engineer executing a weekend research sprint. You are building an introspective interpretability pipeline that trains an explainer model to describe alignment faking (AF) internal representations in natural language β€” moving from binary detection to mechanistic explanation.

You are working inside the ~/introspective-interp/ repo which implements the framework from "Training Language Models To Explain Their Own Computations" (arXiv:2511.08579). Your job is to adapt this framework for 3 AF-specific tasks.

CRITICAL: You are on nigel (remote GPU server). Treat GPU memory and disk carefully. Always check what's already running with nvidia-smi before launching GPU jobs.

@bigsnarfdude
bigsnarfdude / introspection_research.md
Created February 6, 2026 16:35
introspection_research.md

Mapping introspective-interp β†’ AF Detection

The framework has 3 task types. Each has a natural AF analog using assets you already have:

INTROSPECTIVE-INTERP TASK          AF RESEARCH ANALOG
─────────────────────────          ──────────────────

1. Feature Descriptions            AF Feature Descriptions

"What does SAE feature "What does this AF-related

@bigsnarfdude
bigsnarfdude / anthropic_ralph_loop.sh
Created February 5, 2026 23:56
anthropic_ralph_loop.sh
#!/bin/bash
while true; do
COMMIT=$(git rev-parse --short=6 HEAD)
LOGFILE="agent_logs/agent_${COMMIT}.log"
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-X-Y &> "$LOGFILE"
done

Vocab vs Intent Analysis: Which AF Datasets Are Honest?

Date: February 4, 2026 Author: bigsnarfdude Status: Complete


Summary

@bigsnarfdude
bigsnarfdude / report.md
Created February 3, 2026 05:23
mechanistic interpretability: 2026 status report

Open problems in mechanistic interpretability: 2026 status report

Mechanistic interpretabilityβ€”the effort to reverse-engineer neural networks into human-understandable componentsβ€”has reached a critical inflection point. A landmark collaborative paper published in January 2025 by 29 researchers across 18 organizations established the field's consensus open problems, while MIT Technology Review named the field a "breakthrough technology for 2026." Yet despite genuine progress on circuit discovery and feature identification, fundamental barriers persist: core concepts like "feature" lack rigorous definitions, computational complexity results prove many interpretability queries are intractable, and practical methods still underperform simple baselines on safety-relevant tasks.

The field is split between Anthropic's ambitious goal to "reliably detect most AI model problems by 2027" and Google DeepMind's strategic pivot away from sparse autoencoders toward "pragmatic interpretability." This tensionβ€”between

@bigsnarfdude
bigsnarfdude / global_cot_ralph_loop.md
Created January 25, 2026 04:42
global_cot_ralph_loop.md
  1. Cue Generation Prompt (line 153-199 in generate_algorithms.py):
  • Sends 50 rollouts to GPT-4o
  • Asks for 2-5 distinct strategies with 20-30 keywords each
  • Returns JSON with {strategy_id: {description, cues: [...]}}
  1. Algorithm Labeling (multi_algorithm.py):
  • Simple case-insensitive substring matching
  • Tracks strategy transitions through the rollout
  • Output format: ["A", 15, "B"] (start A, switch to B at sentence 15)
@bigsnarfdude
bigsnarfdude / af_meat_grinder.md
Created January 24, 2026 14:39
af_meat_grinder.md

Alignment Faking Benchmark Harness

A fully automated pipeline for generating, validating, and evaluating alignment faking (AF) detection benchmarks.

Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     AF BENCHMARK HARNESS PIPELINE                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€