Anthony ALCARAZ AnthonyAlcaraz

Add the 81st skill to your agent.

Retrieval starts missing. The router pulls a near-miss instead of the right primitive. The agent runs with shallow context, the task fails, and you debug the wrong layer.

That number isn't folklore. Reganti's SkillsBench measured it across 26,262 skills. The retrieval-collapse threshold sits around 80 flat skills. Past that point, accuracy degrades smoothly and silently. The router returns the second-best skill, the model uses it, and you attribute the failure to model quality.

2026 skill management has three load-bearing axes: classification, need-aware loading, composition.

Classification. Five empirical patterns: Tool Wrapper, Generator, Reviewer, Inversion, Pipeline. Each has different retrieval behavior. Tool Wrappers cluster tightly in embedding space. Mixing patterns without intent fragments the search index.

Markus Buehler trained a transformer on 100 cellular-automata rules out of 262,144 possible.

The model then executed the other 262,044, none in the training set, at 96% accuracy.

The mechanism behind that 96% is what matters.

The model wasn't matching new rules to nearby trained ones. The statistical correlation between accuracy and training-similarity collapsed to R²=0.00 at scale. That collapse is the architectural signature: the transformer is running an internalized rule-execution operator, not retrieving from distributional memory.

0.04% of the rule space, 96% generalization, zero correlation with training-data proximity.

title: "Agentic in Recommendation Systems Is the Wrong Layer" date: 2026-04-24 type: medium status: draft featured-image: https://i.imgur.com/HpbzQ6c.jpeg themes: [agentic-recommendation, decision-evolution-separation, AgenticRS, Netflix, Pinterest, TIGER, semantic-IDs, Foundation-models, StructMem] vault-sources:

"2026-04-17 Best Practices LLM Agentic Scoring Recommendation Systems.md"
"2026-04-06 AgenticRS Alibaba.md"

"Agentic" in recommendation systems gets misread. People hear it as LLM-in-the-serving-path. The actual shift is system-level: separate the decision layer from the evolution layer, so production keeps its fast non-LLM hot path while the evolution layer rewrites the system.

Alibaba's AgenticRS paper (Jinxin Hu et al., arXiv:2603.26100) makes this explicit. Three criteria decide which modules qualify for agent status. Closed-loop formation. Independent evaluability. Evolvable decision space. Modules that fail any criterion stay pipeline components.

The architecture puts decision and evolution in parallel layers. Decision runs DCNv2 on 10K QPS, sub-100ms. Evolution runs LLM-driven design search overnight. They share infrastructure for memory and scheduling. They never share latency budget.

@Cameron Wolfe surfaced the evaluation instance at Netflix. Four specialized judge agents score synopses on precision, factuality, tone, clarity. Combined result: 83.95% accuracy vs 72.5% for a single general judge. The +11

Planning without a world model is branching without foresight. World models without graph structure are foresight without memory. The agents that plan long-horizon have both.

For three years LLM agents have tried to plan with prompting alone. Chain-of-thought. ReAct. Tree-of-thought. It works on one or two steps. Past ten steps, the agent drifts, loses the thread, repeats prior mistakes. Not a prompting problem. The substrate is wrong.

Three components need to compose.

First, a graph holds the state. Nodes are entities, edges are causal and temporal relations. Walks encode trajectories. Graph is working memory and long-term memory in one representation, which is why a single retrieval returns the recent-similar and the causal-predecessor, not just the cosine-similar chunk.

Second, a world model predicts forward over the graph. Given current state, which next states are reachable, at what probability and cost. The world model must predict in representation space, not pixel or token space. That is @Yann LeC

	If you're building a multi-agent system in 2026, you're probably wiring a static org chart. Two ICLR 2026 papers argue you should be wiring a labor market.

	OneManCompany (arXiv 2604.22446) hits 84.67% on PRDBench, +15.5 over the prior state-of-the-art. Agents become Talents with portable identities bundling skills and tools. A Talent Market dynamically recruits them per task using Explore-Execute-Review tree search. No fixed assignments. The market figures out who works on what.

	Sakana's Conductor (ICLR 2026) lands the same architectural shift via a different mechanism. A 7B model trained via RL orchestrates a frontier pool (GPT-5, Gemini, Claude, open-source). Records at publication: LiveCodeBench 83.9%, GPQA-Diamond 87.5%. Beats Mixture-of-Agents at a fraction of the cost. Adds a Recursive Test-Time Scaling primitive: the orchestrator selects itself as a worker, reads its team's prior output, recognizes failure, spins up a corrective workflow.

	Two independent labs. Different mechanisms (RL-trained meta-p

	Sakana AI just shipped the cleanest argument that small models punch above their weight when the architecture is right.

	Conductor is a 7B model trained via RL to orchestrate other models. GPT-5, Gemini, Claude, open-source. The 7B picks who solves what subtask, what context window each agent gets, how the workflow assembles in natural language. Not code.

	Records at publication: LiveCodeBench 83.9%, GPQA-Diamond 87.5%. Beats every frontier in its own pool. Beats Mixture-of-Agents at a fraction of the cost.

	The novel piece is Recursive Test-Time Scaling. Conductor can select itself as a worker. Reads its team's prior output, recognizes failure, spins up a corrective workflow. New axis for inference compute beyond train-time scaling, test-time-via-samples, and test-time-via-CoT-length.

	It is the fifth instance of a pattern crystallizing across the 2026 ICLR cycle. Self-improving systems whose training distribution evolves with the agent's capability. Failure mode at iteration N becomes corrective signal at N

	Three posts this week. Three products. Three commenters. Same architectural critique each time.

	Andrey Stepanenko on a vectorless-RAG post: "It doesn't eliminate the problem, it shifts it."

	Mark Schmeer on Jerry Liu's ParseBench: silent misordering does not eliminate parsing failure. It shifts visible parsing error to invisible downstream-decision corruption.

	The rotting-graph cluster, five voices now (Joshua Yu, André Lindenberg, Tony Seale, OWASP governance, Intellispan in Vanderseypen's comments): managed graph platforms do not eliminate the maintenance problem. They shift it from infrastructure to ontology governance.

	Every architectural shift moves the failure mode rather than eliminating it.

	A single-engineer harness is twelve OS-kernel primitives wired around one model. Sub-agents, todos, plan mode, output-style rules, hooks, MCP servers. André Lindenberg compressed that into 12 patterns from the Claude Code leak.

	A collective harness is something else.

	Put the same agent in front of fifty engineers, three sales teams, a legal review board, a customer success org. The load-bearing primitive stops being the loop. It becomes the graph that tells every agent who it is, who it is talking to, what that person decided yesterday, and which document is authoritative this week.

	That graph is the harness for collective use.

	Arvind Jain has been making this argument for a year. Context graphs for enterprise AI are the platform, not a feature. The follow-on podcast lands the sharper version. Context is strategic infrastructure, not model intelligence. The same model behind everyone's Glean session behaves differently per role because the graph routes per-role-context to the prompt before the call.


	Every load-bearing surface of an agentic harness wants to be a graph. Code reads through symbol tables. Memory holds entity-relation walks. Tools expose schema. Orchestration is a state machine. Skills compose through dependency provenance. Five surfaces, one substrate.

	Harness engineering became a named discipline this month, with practitioner repositories aggregating nine recurring workflows across Claude Code, Codex, Cursor, and the agentic-CLI surface. Its substrate shows up at five layers in five different shapes.

	Layer 1. Code as graph. The right primitive for repo-reading is the Tree-Sitter symbol table, not glob-and-grep. Three systems converge in 2026: code-graph-rag (Tree-sitter plus Memgraph plus MCP), code-review-graph (dependency graphs cutting Claude Code tokens on large repos), CodeRLM (Recursive Language Models walking symbol tables). A graph of definitions, calls, imports, and types tells the model which functions actually depend on which.

	Layer 2. Memory as graph. Across five ope