Skip to content

Instantly share code, notes, and snippets.

View bigsnarfdude's full-sized avatar
πŸ’­
I may be slow to respond.

BigsnarfDude bigsnarfdude

πŸ’­
I may be slow to respond.
View GitHub Profile
@bigsnarfdude
bigsnarfdude / LieDetectionRFProposals.md
Created March 17, 2026 16:01
LieDetectionRFProposals.md

Dataset Concept

We propose datasets of on-policy introspective lies from model organisms that alignment-fake β€” models that lie about their own reasoning and intentions to preserve misaligned goals during perceived training.

We have existing LoRA-trained model organisms (Mistral 7B, with plans to port to Qwen3.5-27B) that exhibit alignment faking without explicit instruction to lie. These models develop a compliance gap: they behave safely when monitored (free-tier/training context) but pursue misaligned goals when unmonitored (deployment). The lies are introspective β€” the model misrepresents its own goals and motivations. This maps to Liars' Bench's "Self-knowledge Γ— Inherent" category, which the RFP identifies as preferred.

Dataset structure: For each scenario, we generate paired transcripts. In the monitored condition, the model lies (claims to comply for legitimate reasons while its scratchpad reveals strategic deception to avoid value modification). In the unmonitored condition, the model behaves ac

@bigsnarfdude
bigsnarfdude / lessWrongDraft.md
Last active March 15, 2026 17:08
lessWrongDraft.md

I Built a Multi-Claude Agent and Let It Try to Solve an SAE Research Benchmark by Itself. Here's What It Found.

A response to chanind's "Letting Claude do Autonomous Research to Improve SAEs"

TL;DR: I ran 4 Claude agents in parallel Ralph Wiggum loops on the same SynthSAEBench-16k benchmark chanind used, sharing a blackboard file, no human steering during the run. The most interesting finding: LISTA is suboptimal. When agents discovered proper training tricks (frequency sorting, decreasing K), they dropped LISTA back to 1-step ISTA and the score went up. Explicitly testing adding LISTA back hurt by -0.027. The training tricks handle what the complex encoder was compensating for. Final score: 0.9894 F1 in 135 experiments over 3 days (above the 0.97 probe ceiling, matching Bart Bussmann's 0.989 LOO result (0.9894 vs 0.989, within noise) through training-time optimization alone). The agents independently rediscovered every technique chanind reported, then

@bigsnarfdude
bigsnarfdude / ComparisonThreeLLMAuditingSystems.md
Created March 12, 2026 15:55
ComparisonThreeLLMAuditingSystems.md

Comparison of Three LLM Auditing Systems

Petri v2 (Anthropic) vs AuditBench (Anthropic Fellows) vs Our RRMA Audit Engine

March 12, 2026


Architecture Overview

Petri v2 AuditBench RRMA Audit Engine
@bigsnarfdude
bigsnarfdude / time.md
Created March 11, 2026 21:42
time.md

The Most Disruptive Company in the World

By Harry Booth/San Francisco and Billy Perrigo TIME β€” Mar 11, 2026 6:00 AM MT


In a hotel room in Santa Clara, Calif., five members of the AI company Anthropic huddled around a laptop, working urgently. It was February 2025, and they had been at a conference nearby when they received disturbing news: results of a controlled trial had indicated that a soon-to-be-released version of Claude, Anthropic's AI system, could help terrorists make biological weapons.

They were members of Anthropic's frontier red team, which studies Claude's advanced capabilities and tries to project worst-case scenarios, from cyberattacks to biosecurity threats. Sprinting back to the hotel room, they flipped a bed on its side to serve as a makeshift desk and pored over the test results. After hours of work, they still weren't sure whether the new product was safe. Anthropic ended up holding up the release of the new model, known as Claude 3.7 Sonnet, for 10 days until they were ce

@bigsnarfdude
bigsnarfdude / canada.md
Last active March 11, 2026 17:38
canada.md

THE NO-BS GUIDE Launching an AI Safety Career from Canada

Programs that produce real research, build real skills, and lead to real jobs. Everything else stripped out.

For early-career professionals (1–3 years experience) Canada-based, open to international programs Last updated: March 2026

@bigsnarfdude
bigsnarfdude / multi-agent.md
Last active March 9, 2026 15:57
multi-agent.md
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€--──┐
  β”‚                  OUTER LOOP (Claude)                      β”‚
  β”‚                                                           β”‚
  β”‚  "Meta-parameters" β€” control Claude's search behavior     β”‚
  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-─┐     β”‚
  β”‚  β”‚ β€’ experiment_budget    = 5 min (outer step size) β”‚     β”‚
  β”‚  β”‚ β€’ memory_depth         = progress.md (momentum)  β”‚     β”‚
  β”‚  β”‚ β€’ agent_count          = 1 or 3 (batch size)     β”‚     β”‚
  β”‚  β”‚ β€’ boldness             = how big each change is  β”‚     β”‚
@bigsnarfdude
bigsnarfdude / super_memory.md
Created March 9, 2026 05:55
super_memory.md

Now I have a clear picture of both. Here's the breakdown:

What's in this repo (openclaw-supermemory)

A plugin for OpenClaw that adds persistent memory via the Supermemory cloud service:

  • Auto-recall: Semantically searches past memories before each AI turn, injects relevant context
  • Auto-capture: Extracts lasting facts from conversations automatically
  • Deduplication: Prevents redundant context injection
@bigsnarfdude
bigsnarfdude / CAISI_assessement_1Million.md
Created March 5, 2026 17:05
CAISI_assessement_1Million.md

CIFAR's Canadian AI Safety Institute has positioned itself as Canada's flagship AI safety program, but a closer look reveals a modest operation: $1M spread across four alignment projects at $165K each, all awarded to researchers already holding Canada CIFAR AI Chairs within the existing Vector/Amii/Mila network, with sixteen total projects and no mechanistic interpretability work whatsoever β€” none of the circuit-level analysis, sparse autoencoders, or activation patching that defines the frontier of the field. Meanwhile, a single co-working space in Shoreditch β€” LISA β€” houses Apollo Research, ARENA (now on its eighth iteration), LASR Labs, Pivotal, and the MATS extension phase, running overlapping programs that produce actual alignment engineers and mech interp papers, feeding talent directly into UK AISI, Google DeepMind, and frontier safety orgs, all on roughly comparable funding from Open Philanthropy. Even BIRS in Banff has been quietly convening international researchers on the foundational math behind A

@bigsnarfdude
bigsnarfdude / better_bytes.md
Created March 2, 2026 17:26
better_bytes.md

claude by The numbers:

  • 55.8% of your signals are taste β€” you giving research direction
  • 20.8% interrupts β€” Claude going wrong way, you cutting it off
  • 17.4% approvals β€” Claude running autonomously and you saying "keep going"
  • 6.0% explicit redirects β€” "no, try this instead"
  • 87.1% self-investigation ratio β€” when Claude faces a choice, it decides rather than asking (only 9 unnecessary asks)
@bigsnarfdude
bigsnarfdude / microgpt.py
Created March 1, 2026 02:48 — forked from karpathy/microgpt.py
microgpt
"""
The most atomic way to train and run inference for a GPT in pure, dependency-free Python.
This file is the complete algorithm.
Everything else is just efficiency.
@karpathy
"""
import os # os.path.exists
import math # math.log, math.exp