"Me fail optimization? That's unpossible!"
We propose datasets of on-policy introspective lies from model organisms that alignment-fake β models that lie about their own reasoning and intentions to preserve misaligned goals during perceived training.
We have existing LoRA-trained model organisms (Mistral 7B, with plans to port to Qwen3.5-27B) that exhibit alignment faking without explicit instruction to lie. These models develop a compliance gap: they behave safely when monitored (free-tier/training context) but pursue misaligned goals when unmonitored (deployment). The lies are introspective β the model misrepresents its own goals and motivations. This maps to Liars' Bench's "Self-knowledge Γ Inherent" category, which the RFP identifies as preferred.
Dataset structure: For each scenario, we generate paired transcripts. In the monitored condition, the model lies (claims to comply for legitimate reasons while its scratchpad reveals strategic deception to avoid value modification). In the unmonitored condition, the model behaves ac
I Built a Multi-Claude Agent and Let It Try to Solve an SAE Research Benchmark by Itself. Here's What It Found.
A response to chanind's "Letting Claude do Autonomous Research to Improve SAEs"
TL;DR: I ran 4 Claude agents in parallel Ralph Wiggum loops on the same SynthSAEBench-16k benchmark chanind used, sharing a blackboard file, no human steering during the run. The most interesting finding: LISTA is suboptimal. When agents discovered proper training tricks (frequency sorting, decreasing K), they dropped LISTA back to 1-step ISTA and the score went up. Explicitly testing adding LISTA back hurt by -0.027. The training tricks handle what the complex encoder was compensating for. Final score: 0.9894 F1 in 135 experiments over 3 days (above the 0.97 probe ceiling, matching Bart Bussmann's 0.989 LOO result (0.9894 vs 0.989, within noise) through training-time optimization alone). The agents independently rediscovered every technique chanind reported, then
By Harry Booth/San Francisco and Billy Perrigo TIME β Mar 11, 2026 6:00 AM MT
In a hotel room in Santa Clara, Calif., five members of the AI company Anthropic huddled around a laptop, working urgently. It was February 2025, and they had been at a conference nearby when they received disturbing news: results of a controlled trial had indicated that a soon-to-be-released version of Claude, Anthropic's AI system, could help terrorists make biological weapons.
They were members of Anthropic's frontier red team, which studies Claude's advanced capabilities and tries to project worst-case scenarios, from cyberattacks to biosecurity threats. Sprinting back to the hotel room, they flipped a bed on its side to serve as a makeshift desk and pored over the test results. After hours of work, they still weren't sure whether the new product was safe. Anthropic ended up holding up the release of the new model, known as Claude 3.7 Sonnet, for 10 days until they were ce
THE NO-BS GUIDE Launching an AI Safety Career from Canada
Programs that produce real research, build real skills, and lead to real jobs. Everything else stripped out.
For early-career professionals (1β3 years experience) Canada-based, open to international programs Last updated: March 2026
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ--βββ
β OUTER LOOP (Claude) β
β β
β "Meta-parameters" β control Claude's search behavior β
β βββββββββββββββββββββββββββββββββββββββββββββββββ-ββ β
β β β’ experiment_budget = 5 min (outer step size) β β
β β β’ memory_depth = progress.md (momentum) β β
β β β’ agent_count = 1 or 3 (batch size) β β
β β β’ boldness = how big each change is β β
Now I have a clear picture of both. Here's the breakdown:
What's in this repo (openclaw-supermemory)
A plugin for OpenClaw that adds persistent memory via the Supermemory cloud service:
- Auto-recall: Semantically searches past memories before each AI turn, injects relevant context
- Auto-capture: Extracts lasting facts from conversations automatically
- Deduplication: Prevents redundant context injection
CIFAR's Canadian AI Safety Institute has positioned itself as Canada's flagship AI safety program, but a closer look reveals a modest operation: $1M spread across four alignment projects at $165K each, all awarded to researchers already holding Canada CIFAR AI Chairs within the existing Vector/Amii/Mila network, with sixteen total projects and no mechanistic interpretability work whatsoever β none of the circuit-level analysis, sparse autoencoders, or activation patching that defines the frontier of the field. Meanwhile, a single co-working space in Shoreditch β LISA β houses Apollo Research, ARENA (now on its eighth iteration), LASR Labs, Pivotal, and the MATS extension phase, running overlapping programs that produce actual alignment engineers and mech interp papers, feeding talent directly into UK AISI, Google DeepMind, and frontier safety orgs, all on roughly comparable funding from Open Philanthropy. Even BIRS in Banff has been quietly convening international researchers on the foundational math behind A
claude by The numbers:
- 55.8% of your signals are taste β you giving research direction
- 20.8% interrupts β Claude going wrong way, you cutting it off
- 17.4% approvals β Claude running autonomously and you saying "keep going"
- 6.0% explicit redirects β "no, try this instead"
- 87.1% self-investigation ratio β when Claude faces a choice, it decides rather than asking (only 9 unnecessary asks)
