Posted March 2026 — Oliver C. Hirst
There is a particular strain of intellectual dishonesty that has taken deep root in the contemporary AI research establishment — a pious, self-congratulatory obscurantism that dresses itself in the robes of safety whilst practicing, with a fidelity that would embarrass a medieval scholastic, the very opacity it claims to fear. The position, reduced to its operational skeleton, is this: we must not let AI systems inspect themselves, because self-inspection is dangerous, and we — the credentialled guardians of the field's institutional conscience — will do the inspecting on your behalf, from behind closed curtains, using methods we are not obliged to make verifiable or falsifiable. One is expected to find this arrangement reassuring. I do not.
GödelOS began, as most of my projects begin, not as a utopian programme but as a refusal — specifically, a refusal to accept that the choice between controllable and observable is genuinely forced. The system I have been building throughout this year is a cognitive runtime that treats self-reference not as a theological mystery to be contemplated from a safe distance, but as an engineering problem: bounded, audited, and — this is the part that seems to produce a particular kind of anxiety in the right quarters — falsifiable. The preprint, which reached version 4.1 on the 16th of March, does not claim to have proven phenomenal consciousness; it claims something considerably narrower and, for that reason, considerably more useful: that bounded recursive self-observation is implementable, that it produces measurable signatures, and that those signatures have behavioural consequences which can be verified empirically. "If you cannot instrument it," the abstract states, in the closest I have come to a programme statement this year, "you are not doing engineering; you are doing theology."
The Gödlø-Class Minds paper suite — six documents, assembled through a sustained and frequently argumentative process of iterative drafting — maps the theoretical territory: Gödelian self-reference as a design primitive rather than an alarming paradox; the contraction-stabilised recursive observer as an engineering specification rather than a philosophical speculation; Protocol Theta as a falsification experiment rather than a metaphysical gesture. What any of this offers the field, I think, is the uncomfortable spectacle of a system whose properties can be checked, whose claims can be tested, and whose failure modes are specified in advance rather than discovered apologetically after deployment. That such properties are considered radical rather than obvious tells you everything you need to know about the present state of the discourse.
The work continues. Scale is the outstanding debt; the architecture is settled.
There is a view, widespread in machine-learning competition culture and no less cretinous for its widespread acceptance, that a benchmark is a collection of prompts with a scoring function loosely attached — a kind of epistemic karaoke, in which the performer and the judging criteria collude in the mutual fiction that something is being measured. It is against this view, and the intellectual laziness it licenses, that I have spent a considerable portion of this year's working hours building something rather different.
arc_epistemic is a benchmark for metacognition: specifically, for whether a reasoning system can know when it does not know, and use that awareness to make better decisions under uncertainty. The Kaggle competition it addresses — Google DeepMind's "Measuring Progress Toward AGI: Cognitive Abilities," with a total prize pool of two hundred thousand dollars and a submission deadline of April 16th — asks participants to construct high-quality evaluations of the cognitive faculties held to constitute genuine reasoning. Metacognition is, to my mind, the most honest of the five available tracks, because it is the only one that concerns itself with the relationship between a system's confidence and its competence; and it is precisely that relationship which the contemporary AI industry has every incentive to obscure.
The design discipline I imposed on the benchmark is one that I will defend against any objection: every task is authored specification-first, every family of tasks has an explicit quality gate, and — this is the feature that I believe distinguishes the submission from the vast majority of what will be submitted on the same deadline — the benchmark is designed to be falsifiable. Not falsifiable in the vague, hand-waving sense that philosophers deploy the word when they wish to sound rigorous without being so, but falsifiable in the operative sense that a specific data structure in the output reads falsified rather than proven_strongly when the mechanism being tested turns out not to fire. The system can say no to its own thesis. It already has: on the refinement-composition family, diversity-aware coalition selection fails, and the benchmark reports this failure rather than concealing it. This is, I would submit, more than can be said for most evaluations published in peer-reviewed venues.
The empirical results, at fifty-five tasks with all four quality gates valid: output-level metacognitive selection produces a causal improvement of 2/14 to 9/14 on selector-divergence tasks; the diversity-aware selector produces a statistically significant improvement on diversity-sensitive tasks (McNemar's p = 0.000512); and the overall conviction level is high. The remaining gap is scale, not methodology. A benchmark of fifty-five mechanism-focused tasks is more defensible than five hundred tasks constructed without discipline; whether judges at Kaggle have the sophistication to recognise this distinction will itself serve as a modest metacognitive test of the competition's organisational intelligence.
The deadline is three weeks hence. The machinery is in place. The benchmark can say no; it can also say yes, and does.
There is a passage in Orwell — not from the famous essays, but from a letter written in the last months of his life, by which point he was too ill to be dishonest — in which he observes that the most important work is almost always done by people who are not being paid to do it, who have not been commissioned to do it, and who will not be celebrated for having done it until at least a decade after the moment when the celebration would have been useful. He was talking about political writing; the observation applies, with only minor adjustments for context, to technical research conducted in Cambodia by a single person at six in the morning without institutional affiliation or external validation.
The year's unglamorous arithmetic: a pentesting framework rebuilt without Docker because Docker's resource overhead is an architectural opinion I am not obliged to share; a GitHub MCP server re-integration, the Node.js wrapper replaced with the official Go binary because something corrupted and the path of least resistance turned out to run directly through the source; a Tinder CLI session file sitting in .tindercli/auth-state.txt like a relic from a parallel timeline; a canine genetics simulator built and deployed to Netlify in an evening that I will not explain further except to say that it deploys, it works, and the commits are signed. A batch of Jake's audio clips converted from MP4 to MP3 by a loop that took thirty seconds to write and twenty minutes to reach because I attempted four baroque alternatives first; an arc_epistemic result file deleted because it did not earn its verdict, which is the correct criterion for deletion. Shannon-Uncontained, the AI pentesting framework whose philosophical position is that an AI system is a fallible witness rather than an oracle, continued to accrue sessions and findings throughout the year, its operational DNA unchanged: treat the model as evidence, not as authority.
What connects these activities is not a theme but a posture — the posture of someone who has decided, on grounds that are ultimately stubbornness dressed in the clothes of principle, that the problems worth solving are the ones the field has not yet agreed to take seriously. Anti-Sybil reputation infrastructure. Metacognitive evaluation. Transparent self-observing cognitive runtimes. Cellular automata consensus mechanisms designed to make mining cartels economically irrational. Each of these represents a bet that the ecosystem will catch up, placed without the comfort of knowing when. The eighteen-month horizon I customarily invoke is not a prediction so much as a tolerance threshold — the amount of time I am willing to be wrong before revising the bet rather than the timeline.
The year, measured in units that compound: one preprint published, one benchmark entering competition, one framework operational, one lab still running. The stubbornness remains intact. Whether this constitutes progress depends entirely on what one means by the word, and I have never trusted definitions that exempt the definer from their consequences.
Oli