Deep-research framework where the primary builder is a coding agent, not a human developer.
Status: Design proposal. Not yet implemented. Current @steel-dev/atlas (v0.1.x) ships a single opinionated research() function and CLI. This document specifies how to evolve it into an SDK for building domain-specific researchers, with Agent Experience (AX) as the primary design constraint.
The consumer of this SDK is a coding agent (Claude Code, Cursor, Codex, Devin, custom pipelines). A human says "build me a competitive-intel researcher for AI startups." The agent does the wiring — picks tools, writes prompts, defines schemas, runs evals, ships a CLI.
This is a different design problem from a human-targeted SDK. Almost every choice should be re-evaluated through the lens: does this help or hurt an agent who has never seen this library before, has read only its types, and is iterating against an eval suite?
AX is to coding agents what DX is to humans. The principles diverge in load-bearing ways:
| Conventional DX | AX |
|---|---|
| Prose tutorials, blog posts | Type signatures + working examples in examples/ |
| "There are 3 ways to do X — pick what fits" | One canonical path; alternatives are a footgun |
| Permissive — silently coerce bad input | Strict — fail loudly with structured errors |
| Hide complexity behind magic | Surface state in files the agent can read and diff |
| Stable for years | Stable across model training snapshots |
| Inline string config | Filesystem as API — one concept per file |
| Helpful warnings in console | Machine-readable error objects with hint fields |
| Optimize for first-impression demos | Optimize for the third iteration after eval failure |
- Types are the docs. Every public symbol has a precise type. The agent reads
.d.ts, notdocs/. JSDoc on every public field, in imperative voice. - Filesystem is the schema.
tools/,schemas/,prompts/aren't conventions — they're the API. The shape of a researcher is legible fromls -R. - One canonical path per task. No three ways to register a tool. Optional fields exist for tuning, not for shape.
- Loud structured failures. Every error is a typed object with
code,message,hint. Hints are written for the next-token predictor: imperative, specific, fixable in one edit. - Schemas at every boundary. Tool inputs, tool outputs, researcher outputs, eval expectations — all zod. The agent cannot ship malformed data.
- Idempotent, deterministic dev loop.
npx <researcher> evalis the agent's inner loop. Results must be diffable. Same query + same model + same seed → same output (within budget). - No hidden state. No undocumented cache directories. Persistent state (Steel sessions, eval baselines) lives in named, declared files the agent owns.
- Markdown for prose, code for code. Prompts are
.md. Schemas are.ts. Never embed multi-paragraph strings inside.tsconfig. - Stable API across model snapshots. Once a primitive ships, breaking changes require a major version. The agent's training-data cutoff is your real compatibility contract.
- Telemetry is part of the API. Every run emits a structured event stream the agent can replay, diff, and reason about.
These principles are not aspirational. Every API decision below cites the principle(s) it satisfies.
A researcher is:
- A typed function
(query: string) => Promise<Result<TSchema>> - Backed by an LLM agent loop with access to a fixed set of tools
- Producing data conforming to a declared zod schema, with per-field citations
- Plus a Markdown narrative fallback for human consumption
Atlas provides (the runtime):
- The agent loop (multi-turn tool use against Claude)
- Default tools —
search,inspect,fetch(web + Steel browser fallback) - The schema-bound extraction step (new component, §6)
- The eval harness (new, §9)
- The CLI wrapper (auto-generated, §8)
- Cost/budget/cancellation primitives
- Structured telemetry (§11)
The builder provides (the researcher):
- A zod schema (the output contract)
- A markdown prompt (the persona/priorities)
- 0..N domain-specific tools
- 1..N seed eval cases
Atlas does not own the researcher's business logic. The researcher does not own the agent loop. The seam is defineResearcher().
The directory structure is the API surface. This is principle #2. An agent who runs ls -R on a researcher must be able to enumerate everything it does.
my-researcher/
├── atlas.config.ts # the single entry point — exports a Researcher
├── package.json # `bin` field auto-generated by `create-atlas-researcher`
├── tsconfig.json # standard
├── schemas/
│ ├── output.ts # the researcher's primary output schema
│ └── *.ts # additional shared schemas
├── prompts/
│ ├── system.md # REQUIRED — agent persona, priorities, stop conditions
│ └── extract.md # OPTIONAL — override the schema-extraction prompt
├── tools/
│ └── *.ts # one tool per file, default export is the tool
├── evals/
│ ├── seed.jsonl # eval cases — one JSON per line
│ └── baselines/ # committed reference outputs (optional, gitignored by default)
├── .atlas/
│ └── manifest.json # generated; describes the resolved runtime
└── README.md # auto-generated, regenerated on `npx atlas sync`
- No top-level
src/. Flat directories beat deep nesting for agent navigation. - One concept per file. Tools, schemas, prompts each get their own file. The agent edits one file per change.
- No barrel files.
tools/index.tsis forbidden — agents add a tool by creating a file, not by editing two. - Generated files declare themselves. Every auto-generated file starts with
// GENERATED — do not edit. Regenerate with: npx atlas sync.
The single entry point. Imports siblings, exports a Researcher.
import { defineResearcher } from "@steel-dev/atlas";
import { companyProfile } from "./schemas/output.js";
import { linkedinCompanyTool } from "./tools/linkedin.js";
import { crunchbaseTool } from "./tools/crunchbase.js";
import systemPrompt from "./prompts/system.md" with { type: "text" };
export default defineResearcher({
name: "ai-startup-intel",
version: "0.1.0",
schema: companyProfile,
prompt: systemPrompt,
tools: [linkedinCompanyTool, crunchbaseTool],
// Defaults are included unless explicitly disabled:
defaults: { search: true, fetch: true, inspect: true },
budget: {
maxToolCalls: 30,
maxUsd: 2,
maxWallClockSec: 300,
},
models: {
gather: "claude-sonnet-4-6", // tool-using agent
extract: "claude-sonnet-4-6", // schema-bound extraction
narrative: "claude-sonnet-4-6", // optional markdown fallback
},
});interface Researcher<TSchema extends z.ZodTypeAny> {
/** Run the researcher. Primary entry point. */
run(opts: RunOptions): Promise<ResearchResult<TSchema>>;
/** Adapt this researcher into a tool another researcher can call. */
asTool(opts?: { name?: string; description?: string }): ToolDefinition;
/** Run the eval suite at evals/seed.jsonl. */
eval(opts?: EvalOptions): Promise<EvalReport>;
/** Introspect the resolved runtime (tools, models, budget). Used by `npx atlas sync` and by the agent for self-inspection. */
manifest(): ResearcherManifest;
}interface RunOptions {
query: string;
signal?: AbortSignal;
/** Override budget for this run. */
budget?: Partial<Budget>;
/** Receive structured events. See §11. */
onEvent?: (e: ResearchEvent) => void;
/** Override which schema fields are required for this run. */
fields?: { include?: string[]; exclude?: string[] };
}interface ResearchResult<TSchema extends z.ZodTypeAny> {
/** Schema-bound output. Typed. Validated. Always present. */
data: z.infer<TSchema>;
/** Per-field citation map: dotted-path → source IDs. */
citations: Record<string, number[]>;
/** Optional narrative report. Generated only if `narrative: true` in RunOptions. */
markdown?: string;
/** Every source the agent committed. */
sources: CitedSource[];
/** Usage + cost. */
usage: UsageSummary;
/** Why the agent stopped. */
finish_reason: "complete" | "budget_exhausted" | "tool_limit" | "schema_satisfied" | "cancelled";
}Rationale: Returning data and citations separately (rather than embedding _citations in the schema) keeps the user's schema clean. Principle #5.
const pubmedSearch = defineTool({
name: "pubmed_search",
description:
"Search PubMed for peer-reviewed medical literature. Returns up to 20 results " +
"with title, authors, year, DOI, abstract. Use this BEFORE web search for " +
"any clinical or pharmacological claim.",
input: z.object({
query: z.string().describe("PubMed-style query, e.g. 'GLP-1 AND cardiovascular'"),
years: z.tuple([z.number(), z.number()]).optional().describe("Publication year range"),
limit: z.number().int().min(1).max(20).default(10),
}),
output: z.array(z.object({
title: z.string(),
authors: z.array(z.string()),
year: z.number(),
doi: z.string().nullable(),
abstract: z.string(),
url: z.string().url(),
})),
run: async (input, ctx) => { /* hit E-utilities */ },
});Required: name, description, input, run.
Optional: output (recommended — enables typed downstream use), cost (USD per call hint, for budget tracking).
The description is what the agent reads to decide when to use the tool. It is the most important string in the entire researcher. Conventions enforced by lint (§13):
- Start with a verb.
- Mention when to prefer this tool over alternatives.
- Mention common failure modes.
- 1-3 sentences. No examples (those go in
examples/).
The Steel-native tool factory. The unfair advantage.
const linkedinCompany = defineBrowserTool({
name: "linkedin_company",
description:
"Fetch a LinkedIn company page. Use for authoritative employee count and " +
"recent hiring signals. Falls back to public-only data if no session is bound.",
input: z.object({
handle: z.string().describe('LinkedIn handle, e.g. "anthropicai"'),
}),
output: z.object({
employee_count: z.string().nullable(),
headline: z.string().nullable(),
recent_posts: z.array(z.string()).max(10),
}),
session: "linkedin-prod", // named, persistent Steel session
run: async ({ handle }, { page }) => {
await page.goto(`https://linkedin.com/company/${handle}`);
return {
employee_count: await page.textOrNull('[data-test=employee-count]'),
headline: await page.textOrNull('h2.org-top-card-summary__tagline'),
recent_posts: await page.allText('.org-update-card', { limit: 10 }),
};
},
});The page context is a thin, typed wrapper around Steel — it exposes goto, text, textOrNull, allText, attr, screenshot, and waitFor. It does NOT expose raw Playwright. Principle #3 — one path.
session: "linkedin-prod" references a named Steel session, managed at the Steel API level. The agent never sees credentials. If the session doesn't exist, the tool runs unauthenticated and reports unauthenticated: true in its output (a field auto-injected by the runtime).
Schemas are plain zod. Atlas adds two helpers:
import { z, citable, optional } from "@steel-dev/atlas/schema";
export const companyProfile = z.object({
name: citable(z.string()),
founded_year: citable(z.number().int().min(1900).max(2100).nullable()),
funding: z.object({
total_raised_usd: citable(z.number().nullable()),
}),
one_liner: citable(z.string().max(280)).describe(
"Plain-English description in ≤280 chars. Avoid marketing language. " +
"Prefer the company's own self-description if reliable."
),
});citable(schema)marks a field as requiring ≥1 source citation. The extraction step (§6) enforces this. Uncitable fields can still be filled (e.g., derived values), but skip the citation check..describe(...)is the field's prompt to the extractor. Treat it as a prompt fragment. The agent should write thorough.describe()on every leaf field.
query
│
▼
┌─────────────┐
│ GATHER │ agent loop with tool access (search, fetch, custom tools)
│ │ budget-bounded, terminates when agent says "enough"
└──────┬──────┘
│ sources[], evidence[]
▼
┌─────────────┐
│ EXTRACT │ schema-bound: fill the zod schema from gathered evidence
│ │ uses Anthropic structured outputs + per-field citations
└──────┬──────┘
│ data, citations
▼
┌─────────────┐
│ VALIDATE │ zod parse + citation completeness + budget reconciliation
└──────┬──────┘
│
├──► narrative? (optional) ──► markdown
│
▼
ResearchResult
Each phase is observable, cancellable, and emits structured events.
Current Atlas (v0.1) collapses gather + write into one agent. This spec splits them. The split is the most important architectural change.
This is the part that doesn't exist in current Atlas. It deserves the most detail.
The gather agent collects a pool of source pages. We need to fill a zod schema from those pages, with citations per field, while:
- Respecting per-field
.describe()hints - Enforcing
citable()constraints - Handling fields that genuinely cannot be determined (return
null, never hallucinate) - Being explainable — the agent must be able to inspect why a field got a value
INPUT:
- query
- schema (zod)
- source pool: [{ n, url, title, markdown }]
- field hints (from .describe())
- extraction prompt (prompts/extract.md, optional override)
PROCESS:
Stage 1: Field plan
The extractor model sees the schema as a flat list of leaf fields.
For each field it decides:
- Which sources are likely to contain the answer? (by [n])
- What's the confidence floor needed?
Output: { field_path: string, candidate_sources: number[] }[]
Stage 2: Per-field extraction
For each field, send Claude a focused prompt:
- Field path
- Field schema (zod → JSON schema)
- Field .describe() hint
- Only the candidate sources from Stage 1, packed
Use Anthropic's structured outputs to enforce the schema.
Return: { value, citations: [n...], confidence: "high"|"medium"|"low"|"unknown" }
Stage 3: Assembly
Merge field outputs into the full object.
Run zod parse.
Verify citable() constraints.
If any field is "unknown" AND required → set null AND record a "low_confidence" note.
OUTPUT:
- data: z.infer<TSchema>
- citations: Record<field_path, source_ids[]>
- confidence_notes: Record<field_path, ConfidenceNote>
- Token economy. Per-field extraction with a curated source subset is cheaper than asking Claude to fill 30 fields from 200K tokens of sources.
- Prompt caching. Stage 1 produces a stable plan that can be cached; Stage 2 issues parallel calls.
- Citations are structural, not narrative. Asking "which sources back this field" per-field gives a precise answer; asking once for everything produces lossy
[1, 3, 7]lists. - Diagnosability. When the agent's eval fails, "field X had only low-confidence sources" is actionable. "Schema validation failed" is not.
type ExtractionError =
| { code: "schema_unsatisfiable"; field: string; reason: string; hint: string }
| { code: "missing_citation"; field: string; hint: string }
| { code: "no_candidate_sources"; field: string; hint: string }
| { code: "low_confidence_required_field"; field: string; hint: string };Example hint string:
"Field 'funding.total_raised_usd' had no candidate sources.
Add a tool that hits Crunchbase or SEC filings, or relax the schema
by making this field .nullable()."
The agent reads this and acts. Principle #4.
The default extraction prompt is opinionated. Researchers in domains with unusual norms (legal, scientific) can override it:
# Extraction priorities
When citing legal precedent:
- Prefer the original opinion over secondary commentary.
- Cite by Bluebook format in the citation map, not just URL.
- If two sources conflict, prefer the more recent.This file is appended to the default extraction prompt, not replacing it. Principle #3 — one path, with refinement.
Every researcher gets these unless explicitly disabled:
| Tool | Purpose | Backed by |
|---|---|---|
search |
Web search across providers (DDG default, fallback chain) | src/search.ts |
inspect |
Fetch a URL, return content, do NOT commit as source | plain-fetch → Steel fallback |
fetch |
Fetch a URL AND commit it as a cited source | plain-fetch → Steel fallback |
These are exposed identically to today's Atlas, so this spec inherits their contract.
defaults: { search: false, fetch: true, inspect: true }For domain researchers where the public web is noise (e.g., "ask our docs" against a private corpus), disable search and provide only domain tools.
Out of scope for v1. The shape: mcpServers: [...] in config, tools auto-discovered. Mentioned only so the design doesn't preclude it.
Every researcher gets a CLI for free. create-atlas-researcher writes a 3-line shim:
// bin/cli.ts (generated)
import researcher from "../atlas.config.js";
import { runCli } from "@steel-dev/atlas/cli";
runCli(researcher);Usage:
$ npx my-researcher "<query>" # primary
$ npx my-researcher "<query>" --json # data only, no narrative
$ npx my-researcher "<query>" --markdown # narrative + data
$ npx my-researcher "<query>" --out out.json
$ npx my-researcher eval # run evals/seed.jsonl
$ npx my-researcher eval --update-baselines # commit current outputs as baseline
$ npx my-researcher sync # regenerate README.md + .atlas/manifest.json
$ npx my-researcher inspect # print resolved config (tools, models, budgets)--json is the default for piping into other agents. Principle #5 — schema-validated output is the first-class shape.
Evals are not a nice-to-have. They are the only feedback loop the agent has to know if its researcher works. Principle #6.
One case per line. Each case has a query and an expect block:
{"query": "Profile of Anthropic", "expect": {"founded_year": 2021, "hq_location": "San Francisco"}}
{"query": "Profile of Mercor", "expect": {"product_category": {"includes": "AI hiring"}}}
{"query": "Profile of Mistral", "expect": {"funding.last_round.stage": {"in": ["series-a","series-b"]}}}Each leaf in expect is one of:
- A literal value:
2021→ strict equality - An object with a matcher key:
{"includes": x}— array contains x{"in": [x, y]}— value is one of{"matches": "regex"}— string regex match{"approx": n, "tolerance": 0.1}— numeric tolerance{"not_null": true}— value exists{"semantically": "description", "model": "claude-haiku"}— LLM-as-judge with a small model
Dotted paths address nested fields. The grammar is small on purpose — the agent should always know which matcher applies. Principle #3.
$ npx my-researcher eval
Running 12 cases against ai-startup-intel@0.1.0…
✓ Anthropic schema:ok matches:4/4 $0.31 18s
✗ Mercor schema:ok matches:2/3 $0.44 22s
- product_category: expected includes "AI hiring", got ["talent matching"]
✗ Mistral schema:fail $0.12 8s
- funding.last_round: required, got null
hint: 'no_candidate_sources' — add a Crunchbase tool or relax schema
10/12 schema-valid · 8/12 expectations met · avg $0.34 · avg 19s · total $4.10Exit code: non-zero on any failure. The agent loops:
loop:
edit a tool / prompt / schema
run `npx my-researcher eval`
read failures
→ repeat until green
$ npx my-researcher eval --update-baselinesWrites each case's output to evals/baselines/<case-id>.json. Subsequent eval runs diff against baselines for fields not covered by expect. Lets the agent track regression on non-asserted fields without committing to specific values.
Researchers are stochastic. The eval runner:
- Runs each case
N=1by default; configurable toN=3for variance. - Reports pass-rate, not a single boolean, when
N > 1. - Caches gather phase outputs by content hash so re-runs after prompt-only edits are cheap.
Every error thrown by the SDK is one of these:
type AtlasError =
| ConfigError // researcher misconfigured at boot
| ToolError // tool execution failed
| ExtractionError // schema-bound extraction failed (see §6.4)
| BudgetError // hit a budget cap
| RuntimeError; // anything else — generic, has a request_id
interface ConfigError {
code: "config_invalid";
message: string; // for humans
hint: string; // for the agent — imperative, specific
field?: string; // which config key
doc_anchor?: string; // a stable URL fragment
}Examples of good vs bad hints:
BAD: "Schema validation failed"
GOOD: "Field 'team_size' is required but received null.
Either: (a) mark it .nullable() in schemas/output.ts,
or (b) add a tool that returns headcount data."
BAD: "Tool 'linkedin_company' returned malformed output"
GOOD: "Tool 'linkedin_company' returned { handle: 'x' } but its declared output
schema expects { employee_count, headline, recent_posts }.
Update tools/linkedin.ts to return matching keys, or update its `output:` schema."
The hint is the AX feature. Get this right and an agent can fix its own researcher.
Every run emits a typed event stream:
type ResearchEvent =
// lifecycle
| { type: "run_started"; query: string; budget: Budget }
| { type: "run_finished"; result: ResearchResult }
// gather phase
| { type: "gather_started" }
| { type: "tool_call"; tool: string; input: unknown; call_id: string }
| { type: "tool_result"; call_id: string; ok: boolean; latency_ms: number }
| { type: "source_committed"; n: number; url: string; title: string }
| { type: "gather_finished"; sources: number; tool_calls: number }
// extract phase
| { type: "extract_started"; fields: number; sources: number }
| { type: "field_planned"; field: string; candidates: number[] }
| { type: "field_filled"; field: string; confidence: string; citations: number[] }
| { type: "extract_finished" }
// narrative (optional)
| { type: "narrative_started" }
| { type: "narrative_finished"; chars: number }
// failures
| { type: "error"; error: AtlasError };Events are observable via onEvent, JSON-serializable, replay-safe. The CLI's --json mode pipes them line-delimited to stderr.
$ npx my-researcher "<query>" --json 2> run.jsonl > out.json
$ npx atlas replay run.jsonl # pretty-print the runThe agent can ingest run.jsonl and reason about why a run produced a given output without re-running it.
Steel is the substrate. Atlas treats it as a first-class capability, not an opaque dependency.
Existing behavior: fetch/inspect use plain HTTP first, Steel only when the page requires it. Unchanged.
defineBrowserTool({ session: "<name>" }) references a session managed via Steel's session API. Sessions persist across runs. Atlas does not manage credentials.
# managed externally:
$ steel session create linkedin-prod --auth-flow ./flows/linkedin.tsIf a referenced session does not exist, the tool runs in unauthenticated mode and the runtime injects { unauthenticated: true } into the tool's output. The agent should branch on this.
The page argument passed to defineBrowserTool run is a typed wrapper:
interface BrowserPage {
goto(url: string, opts?: { waitFor?: string }): Promise<void>;
text(selector: string): Promise<string>;
textOrNull(selector: string): Promise<string | null>;
allText(selector: string, opts?: { limit?: number }): Promise<string[]>;
attr(selector: string, name: string): Promise<string | null>;
screenshot(): Promise<Buffer>;
waitFor(selector: string, opts?: { timeoutMs?: number }): Promise<void>;
// Escape hatch — typed, but documented as "use sparingly":
evaluate<T>(fn: () => T): Promise<T>;
}No raw Playwright. No raw CDP. One path. Principle #3.
import drugInteractionResearcher from "./drug-interactions/atlas.config.js";
export default defineResearcher({
// ...
tools: [
drugInteractionResearcher.asTool({
name: "check_drug_interactions",
description:
"Check known interactions for a list of drugs. Returns structured " +
"interaction data. Use before recommending any combination therapy.",
}),
],
});.asTool():
- Reuses the sub-researcher's output schema as the tool's output schema.
- Inherits the parent researcher's
signalfor cancellation. - Accounts its cost against the parent's budget.
- Suppresses its own CLI/eval surface.
Composition is bounded:
- Default
maxDepth: 3for sub-researcher calls. - Cycles detected and rejected at config time (graph walk).
budgetis enforced cumulatively across the entire researcher tree.
$ npx create-atlas-researcher ai-startup-intelInteractive prompts:
- Output kind?
structured-only|structured-with-narrative|narrative-only(mapped to which models are enabled) - Steel sessions needed?
no|yes — list names - Eval set kind?
empty|seed-from-examples(pulls 5 cases from a curated registry of public examples per domain hint)
Writes the full layout from §3, populated with working defaults. The agent's first action is npx <name> eval — the seed should pass on a blank profile so the agent has a known-good baseline.
create-atlas-researcher --template <name> skips prompts. Initial templates:
competitive-intelmedical-literaturelegal-precedentinternal-docs-rag(assumes a corpus tool)blank
Templates are an AX-friendly form of presets — the agent gets a working researcher to mutate, not a blank config to fill. Principle #1.
| Current (v0.1) | Spec target | Notes |
|---|---|---|
research({ query }) single function |
defineResearcher(...).run({ query }) |
Current research() stays as presets.web.run({ query }) for compat |
| Markdown-only output | Schema + markdown | Markdown becomes optional fallback |
| Single gather agent writes report | Gather → Extract → Validate | The major architectural change |
Tools hardcoded in tools.ts |
defineTool + defineBrowserTool factories |
Tools become public API |
| Models hardcoded constants | Per-phase model config | gather/extract/narrative |
| No eval harness | evals/seed.jsonl + CLI |
New |
| No scaffold | create-atlas-researcher |
New package |
| Steel as private impl detail | Steel sessions as named, declared resources | New session model |
The current research() survives as presets.web — it remains the "deep research that just works" baseline.
These are valuable but excluded from v1 to keep the surface small. Principle #3.
- MCP server support. Architecturally fine to add later. Not v1.
- Multi-model providers. Anthropic-only. OpenAI/Gemini support is a hard fork, not a v1 feature.
- Streaming partial schemas. Schema-bound extraction returns a complete object or fails. No streaming.
- A web UI / dashboard. CLI + filesystem only.
- Hosted runtime. Atlas is a library. Running researchers in production is the user's problem (with
signaland structured events to make it tractable). - Tool marketplaces. Tools are files in a repo. No registry.
These are unresolved and affect the design materially. Flagged for human review.
- Markdown imports.
import x from "./x.md" with { type: "text" }is stage-3-ish but inconsistent across runtimes. Fallback: a build step that compiles.md→.tsexporting a string. Decision needed. - Citation granularity. Currently per-field. Should it be per-clause (sentence-level)? Probably yes for narrative mode, no for structured. Confirm.
- Eval LLM-as-judge model. Defaulting to Haiku for cost. Should the judge model be configurable per-case? Probably yes — but raises eval determinism concerns.
- Steel session ownership. Atlas references sessions by name but does not create them. Does Atlas need a
create-sessionflow for AX, or does Steel CLI cover it? Probably Steel CLI. - Schema migrations. When a researcher's schema changes, what happens to existing eval baselines? Need a migration story before researchers ship to production.
- Cost prediction. Today's budget is enforced reactively (kill on exceed). Should there be a "dry run" that estimates cost before spending? Useful but expensive to model accurately.
- The narrative phase. Schema is primary, narrative is fallback — but for a human-shaped product (a research report), is that the wrong default? Maybe
kind: "report" | "extract"at the researcher level disambiguates. - Multi-tenant Steel. If a researcher runs as a service serving many users, sessions need to be per-user, not per-researcher. Out of v1 scope but the API shape should not preclude it.
A v1 ship requires all of these. Each is testable.
-
npx create-atlas-researcher <name>produces a working researcher in <30s. - The blank researcher's
evalcommand passes its seed cases. - A coding agent (Claude Code, given only the README, the type declarations, and an example template) can add a new tool, update the schema, and pass evals — without human intervention beyond the initial goal.
- Every error thrown by the SDK has a typed code and a
hintthat names a file path and a concrete edit. -
runis cancellable mid-tool-call viaAbortSignal. - Existing
@steel-dev/atlasusers can opt into the new pipeline without rewriting (viapresets.web). - The full event stream replays deterministically given the same source cache.
- All public types have JSDoc with at least one example each.
-
.d.tsbundle size <30KB (agents read these).
End-to-end. This is the example the README points the agent at.
import { defineResearcher } from "@steel-dev/atlas";
import { companyProfile } from "./schemas/output.js";
import { linkedinCompany } from "./tools/linkedin.js";
import { crunchbase } from "./tools/crunchbase.js";
import systemPrompt from "./prompts/system.md" with { type: "text" };
export default defineResearcher({
name: "ai-startup-intel",
version: "0.1.0",
schema: companyProfile,
prompt: systemPrompt,
tools: [linkedinCompany, crunchbase],
budget: { maxToolCalls: 30, maxUsd: 2 },
});import { z, citable } from "@steel-dev/atlas/schema";
export const companyProfile = z.object({
name: citable(z.string()),
url: citable(z.string().url()),
one_liner: citable(z.string().max(280)).describe(
"Plain-English description in ≤280 chars. Prefer the company's own self-description."
),
founded_year: citable(z.number().int().min(1900).max(2100).nullable()),
hq_location: citable(z.string().nullable()),
funding: z.object({
total_raised_usd: citable(z.number().nullable()),
last_round: citable(z.object({
stage: z.enum(["pre-seed","seed","series-a","series-b","series-c+","unknown"]),
amount_usd: z.number().nullable(),
date: z.string().nullable(),
lead_investor: z.string().nullable(),
}).nullable()),
}),
team_size: citable(z.number().int().nullable()),
product_category: citable(z.array(z.string()).max(5)),
differentiators: citable(z.array(z.string()).max(3)).describe(
"Specific, concrete differentiators. Avoid marketing language like 'AI-powered' or 'enterprise-grade'."
),
});You are an AI-startup intelligence analyst working for a VC associate.
## Priorities (in order)
1. Verify funding numbers with at least 2 independent sources.
2. Prefer primary sources (company site, SEC filings, press releases) over aggregators.
3. If a field cannot be confidently determined, return null. Never guess.
4. Use `linkedin_company` for team size before falling back to web search.
5. Use `crunchbase` for funding before falling back to news articles.
## Stop conditions
- Every required field has a value or a justified null.
- Every non-null field has ≥1 candidate source for citation.import { defineBrowserTool } from "@steel-dev/atlas";
import { z } from "zod";
export const linkedinCompany = defineBrowserTool({
name: "linkedin_company",
description:
"Fetch a LinkedIn company page. Use for authoritative employee count and recent " +
"hiring signals. Falls back to public-only data if no session is bound.",
input: z.object({
handle: z.string().describe('LinkedIn handle, e.g. "anthropicai"'),
}),
output: z.object({
employee_count: z.string().nullable(),
headline: z.string().nullable(),
recent_posts: z.array(z.string()).max(10),
}),
session: "linkedin-prod",
run: async ({ handle }, { page }) => {
await page.goto(`https://linkedin.com/company/${handle}`, {
waitFor: 'h1.org-top-card-summary__title',
});
return {
employee_count: await page.textOrNull('[data-test=employee-count]'),
headline: await page.textOrNull('h2.org-top-card-summary__tagline'),
recent_posts: await page.allText('.org-update-card', { limit: 10 }),
};
},
});{"id": "anthropic", "query": "Profile of Anthropic", "expect": {"founded_year": 2021, "hq_location": "San Francisco", "product_category": {"includes": "AI assistant"}}}
{"id": "mercor", "query": "Profile of Mercor", "expect": {"product_category": {"includes": "AI hiring"}}}
{"id": "mistral", "query": "Profile of Mistral AI", "expect": {"hq_location": {"matches": "Paris|France"}}}- Reads README → understands the layout.
- Runs
npx ai-startup-intel eval→ sees baseline pass rate. - User says "add a check for whether the company has a public GitHub org."
- Agent:
- Adds
github_org: citable(z.string().url().nullable())toschemas/output.ts. - Writes
tools/github.tsreturning org metadata. - Updates
prompts/system.mdto mention GitHub. - Adds an eval case asserting
github_orgfor one known company. - Runs
npx ai-startup-intel eval. - Reads failures, iterates until green.
- Adds
- Ships.
That loop — edit → eval → iterate — is the entire AX value proposition.
End of specification.