Skip to content

Instantly share code, notes, and snippets.

@nibzard
Created May 25, 2026 18:31
Show Gist options
  • Select an option

  • Save nibzard/37fd2f94fb0f249eeea1b5c4278d79c8 to your computer and use it in GitHub Desktop.

Select an option

Save nibzard/37fd2f94fb0f249eeea1b5c4278d79c8 to your computer and use it in GitHub Desktop.
Atlas SDK Specification — deep-research framework designed for coding agents as primary builders (Agent Experience / AX as design north star)

Atlas SDK Specification

Deep-research framework where the primary builder is a coding agent, not a human developer.

Status: Design proposal. Not yet implemented. Current @steel-dev/atlas (v0.1.x) ships a single opinionated research() function and CLI. This document specifies how to evolve it into an SDK for building domain-specific researchers, with Agent Experience (AX) as the primary design constraint.


0. Why this exists

The consumer of this SDK is a coding agent (Claude Code, Cursor, Codex, Devin, custom pipelines). A human says "build me a competitive-intel researcher for AI startups." The agent does the wiring — picks tools, writes prompts, defines schemas, runs evals, ships a CLI.

This is a different design problem from a human-targeted SDK. Almost every choice should be re-evaluated through the lens: does this help or hurt an agent who has never seen this library before, has read only its types, and is iterating against an eval suite?


1. North Star: Agent Experience (AX)

AX is to coding agents what DX is to humans. The principles diverge in load-bearing ways:

Conventional DX AX
Prose tutorials, blog posts Type signatures + working examples in examples/
"There are 3 ways to do X — pick what fits" One canonical path; alternatives are a footgun
Permissive — silently coerce bad input Strict — fail loudly with structured errors
Hide complexity behind magic Surface state in files the agent can read and diff
Stable for years Stable across model training snapshots
Inline string config Filesystem as API — one concept per file
Helpful warnings in console Machine-readable error objects with hint fields
Optimize for first-impression demos Optimize for the third iteration after eval failure

1.1 AX principles (load-bearing — every later section must respect these)

  1. Types are the docs. Every public symbol has a precise type. The agent reads .d.ts, not docs/. JSDoc on every public field, in imperative voice.
  2. Filesystem is the schema. tools/, schemas/, prompts/ aren't conventions — they're the API. The shape of a researcher is legible from ls -R.
  3. One canonical path per task. No three ways to register a tool. Optional fields exist for tuning, not for shape.
  4. Loud structured failures. Every error is a typed object with code, message, hint. Hints are written for the next-token predictor: imperative, specific, fixable in one edit.
  5. Schemas at every boundary. Tool inputs, tool outputs, researcher outputs, eval expectations — all zod. The agent cannot ship malformed data.
  6. Idempotent, deterministic dev loop. npx <researcher> eval is the agent's inner loop. Results must be diffable. Same query + same model + same seed → same output (within budget).
  7. No hidden state. No undocumented cache directories. Persistent state (Steel sessions, eval baselines) lives in named, declared files the agent owns.
  8. Markdown for prose, code for code. Prompts are .md. Schemas are .ts. Never embed multi-paragraph strings inside .ts config.
  9. Stable API across model snapshots. Once a primitive ships, breaking changes require a major version. The agent's training-data cutoff is your real compatibility contract.
  10. Telemetry is part of the API. Every run emits a structured event stream the agent can replay, diff, and reason about.

These principles are not aspirational. Every API decision below cites the principle(s) it satisfies.


2. Mental Model

A researcher is:

  • A typed function (query: string) => Promise<Result<TSchema>>
  • Backed by an LLM agent loop with access to a fixed set of tools
  • Producing data conforming to a declared zod schema, with per-field citations
  • Plus a Markdown narrative fallback for human consumption

Atlas provides (the runtime):

  • The agent loop (multi-turn tool use against Claude)
  • Default tools — search, inspect, fetch (web + Steel browser fallback)
  • The schema-bound extraction step (new component, §6)
  • The eval harness (new, §9)
  • The CLI wrapper (auto-generated, §8)
  • Cost/budget/cancellation primitives
  • Structured telemetry (§11)

The builder provides (the researcher):

  • A zod schema (the output contract)
  • A markdown prompt (the persona/priorities)
  • 0..N domain-specific tools
  • 1..N seed eval cases

Atlas does not own the researcher's business logic. The researcher does not own the agent loop. The seam is defineResearcher().


3. Project Layout (the API)

The directory structure is the API surface. This is principle #2. An agent who runs ls -R on a researcher must be able to enumerate everything it does.

my-researcher/
├── atlas.config.ts          # the single entry point — exports a Researcher
├── package.json             # `bin` field auto-generated by `create-atlas-researcher`
├── tsconfig.json            # standard
├── schemas/
│   ├── output.ts            # the researcher's primary output schema
│   └── *.ts                 # additional shared schemas
├── prompts/
│   ├── system.md            # REQUIRED — agent persona, priorities, stop conditions
│   └── extract.md           # OPTIONAL — override the schema-extraction prompt
├── tools/
│   └── *.ts                 # one tool per file, default export is the tool
├── evals/
│   ├── seed.jsonl           # eval cases — one JSON per line
│   └── baselines/           # committed reference outputs (optional, gitignored by default)
├── .atlas/
│   └── manifest.json        # generated; describes the resolved runtime
└── README.md                # auto-generated, regenerated on `npx atlas sync`

3.1 Rules

  • No top-level src/. Flat directories beat deep nesting for agent navigation.
  • One concept per file. Tools, schemas, prompts each get their own file. The agent edits one file per change.
  • No barrel files. tools/index.ts is forbidden — agents add a tool by creating a file, not by editing two.
  • Generated files declare themselves. Every auto-generated file starts with // GENERATED — do not edit. Regenerate with: npx atlas sync.

4. Core Primitives

4.1 defineResearcher

The single entry point. Imports siblings, exports a Researcher.

import { defineResearcher } from "@steel-dev/atlas";
import { companyProfile } from "./schemas/output.js";
import { linkedinCompanyTool } from "./tools/linkedin.js";
import { crunchbaseTool } from "./tools/crunchbase.js";
import systemPrompt from "./prompts/system.md" with { type: "text" };

export default defineResearcher({
  name: "ai-startup-intel",
  version: "0.1.0",
  schema: companyProfile,
  prompt: systemPrompt,
  tools: [linkedinCompanyTool, crunchbaseTool],
  // Defaults are included unless explicitly disabled:
  defaults: { search: true, fetch: true, inspect: true },
  budget: {
    maxToolCalls: 30,
    maxUsd: 2,
    maxWallClockSec: 300,
  },
  models: {
    gather: "claude-sonnet-4-6",      // tool-using agent
    extract: "claude-sonnet-4-6",     // schema-bound extraction
    narrative: "claude-sonnet-4-6",   // optional markdown fallback
  },
});

Returns: Researcher<TSchema>

interface Researcher<TSchema extends z.ZodTypeAny> {
  /** Run the researcher. Primary entry point. */
  run(opts: RunOptions): Promise<ResearchResult<TSchema>>;

  /** Adapt this researcher into a tool another researcher can call. */
  asTool(opts?: { name?: string; description?: string }): ToolDefinition;

  /** Run the eval suite at evals/seed.jsonl. */
  eval(opts?: EvalOptions): Promise<EvalReport>;

  /** Introspect the resolved runtime (tools, models, budget). Used by `npx atlas sync` and by the agent for self-inspection. */
  manifest(): ResearcherManifest;
}

RunOptions

interface RunOptions {
  query: string;
  signal?: AbortSignal;
  /** Override budget for this run. */
  budget?: Partial<Budget>;
  /** Receive structured events. See §11. */
  onEvent?: (e: ResearchEvent) => void;
  /** Override which schema fields are required for this run. */
  fields?: { include?: string[]; exclude?: string[] };
}

ResearchResult<TSchema>

interface ResearchResult<TSchema extends z.ZodTypeAny> {
  /** Schema-bound output. Typed. Validated. Always present. */
  data: z.infer<TSchema>;
  /** Per-field citation map: dotted-path → source IDs. */
  citations: Record<string, number[]>;
  /** Optional narrative report. Generated only if `narrative: true` in RunOptions. */
  markdown?: string;
  /** Every source the agent committed. */
  sources: CitedSource[];
  /** Usage + cost. */
  usage: UsageSummary;
  /** Why the agent stopped. */
  finish_reason: "complete" | "budget_exhausted" | "tool_limit" | "schema_satisfied" | "cancelled";
}

Rationale: Returning data and citations separately (rather than embedding _citations in the schema) keeps the user's schema clean. Principle #5.

4.2 defineTool

const pubmedSearch = defineTool({
  name: "pubmed_search",
  description:
    "Search PubMed for peer-reviewed medical literature. Returns up to 20 results " +
    "with title, authors, year, DOI, abstract. Use this BEFORE web search for " +
    "any clinical or pharmacological claim.",
  input: z.object({
    query: z.string().describe("PubMed-style query, e.g. 'GLP-1 AND cardiovascular'"),
    years: z.tuple([z.number(), z.number()]).optional().describe("Publication year range"),
    limit: z.number().int().min(1).max(20).default(10),
  }),
  output: z.array(z.object({
    title: z.string(),
    authors: z.array(z.string()),
    year: z.number(),
    doi: z.string().nullable(),
    abstract: z.string(),
    url: z.string().url(),
  })),
  run: async (input, ctx) => { /* hit E-utilities */ },
});

Required: name, description, input, run. Optional: output (recommended — enables typed downstream use), cost (USD per call hint, for budget tracking).

The description is what the agent reads to decide when to use the tool. It is the most important string in the entire researcher. Conventions enforced by lint (§13):

  • Start with a verb.
  • Mention when to prefer this tool over alternatives.
  • Mention common failure modes.
  • 1-3 sentences. No examples (those go in examples/).

4.3 defineBrowserTool

The Steel-native tool factory. The unfair advantage.

const linkedinCompany = defineBrowserTool({
  name: "linkedin_company",
  description:
    "Fetch a LinkedIn company page. Use for authoritative employee count and " +
    "recent hiring signals. Falls back to public-only data if no session is bound.",
  input: z.object({
    handle: z.string().describe('LinkedIn handle, e.g. "anthropicai"'),
  }),
  output: z.object({
    employee_count: z.string().nullable(),
    headline: z.string().nullable(),
    recent_posts: z.array(z.string()).max(10),
  }),
  session: "linkedin-prod",   // named, persistent Steel session
  run: async ({ handle }, { page }) => {
    await page.goto(`https://linkedin.com/company/${handle}`);
    return {
      employee_count: await page.textOrNull('[data-test=employee-count]'),
      headline: await page.textOrNull('h2.org-top-card-summary__tagline'),
      recent_posts: await page.allText('.org-update-card', { limit: 10 }),
    };
  },
});

The page context is a thin, typed wrapper around Steel — it exposes goto, text, textOrNull, allText, attr, screenshot, and waitFor. It does NOT expose raw Playwright. Principle #3 — one path.

session: "linkedin-prod" references a named Steel session, managed at the Steel API level. The agent never sees credentials. If the session doesn't exist, the tool runs unauthenticated and reports unauthenticated: true in its output (a field auto-injected by the runtime).

4.4 Schemas: zod, with field-level hints

Schemas are plain zod. Atlas adds two helpers:

import { z, citable, optional } from "@steel-dev/atlas/schema";

export const companyProfile = z.object({
  name: citable(z.string()),
  founded_year: citable(z.number().int().min(1900).max(2100).nullable()),
  funding: z.object({
    total_raised_usd: citable(z.number().nullable()),
  }),
  one_liner: citable(z.string().max(280)).describe(
    "Plain-English description in ≤280 chars. Avoid marketing language. " +
    "Prefer the company's own self-description if reliable."
  ),
});
  • citable(schema) marks a field as requiring ≥1 source citation. The extraction step (§6) enforces this. Uncitable fields can still be filled (e.g., derived values), but skip the citation check.
  • .describe(...) is the field's prompt to the extractor. Treat it as a prompt fragment. The agent should write thorough .describe() on every leaf field.

5. The Pipeline

query
  │
  ▼
┌─────────────┐
│  GATHER     │  agent loop with tool access (search, fetch, custom tools)
│             │  budget-bounded, terminates when agent says "enough"
└──────┬──────┘
       │ sources[], evidence[]
       ▼
┌─────────────┐
│  EXTRACT    │  schema-bound: fill the zod schema from gathered evidence
│             │  uses Anthropic structured outputs + per-field citations
└──────┬──────┘
       │ data, citations
       ▼
┌─────────────┐
│  VALIDATE   │  zod parse + citation completeness + budget reconciliation
└──────┬──────┘
       │
       ├──► narrative? (optional) ──► markdown
       │
       ▼
   ResearchResult

Each phase is observable, cancellable, and emits structured events.

Current Atlas (v0.1) collapses gather + write into one agent. This spec splits them. The split is the most important architectural change.


6. Schema-bound Extraction (the new component)

This is the part that doesn't exist in current Atlas. It deserves the most detail.

6.1 The problem

The gather agent collects a pool of source pages. We need to fill a zod schema from those pages, with citations per field, while:

  • Respecting per-field .describe() hints
  • Enforcing citable() constraints
  • Handling fields that genuinely cannot be determined (return null, never hallucinate)
  • Being explainable — the agent must be able to inspect why a field got a value

6.2 The mechanism

INPUT:
  - query
  - schema (zod)
  - source pool: [{ n, url, title, markdown }]
  - field hints (from .describe())
  - extraction prompt (prompts/extract.md, optional override)

PROCESS:
  Stage 1: Field plan
    The extractor model sees the schema as a flat list of leaf fields.
    For each field it decides:
      - Which sources are likely to contain the answer? (by [n])
      - What's the confidence floor needed?
    Output: { field_path: string, candidate_sources: number[] }[]

  Stage 2: Per-field extraction
    For each field, send Claude a focused prompt:
      - Field path
      - Field schema (zod → JSON schema)
      - Field .describe() hint
      - Only the candidate sources from Stage 1, packed
    Use Anthropic's structured outputs to enforce the schema.
    Return: { value, citations: [n...], confidence: "high"|"medium"|"low"|"unknown" }

  Stage 3: Assembly
    Merge field outputs into the full object.
    Run zod parse.
    Verify citable() constraints.
    If any field is "unknown" AND required → set null AND record a "low_confidence" note.

OUTPUT:
  - data: z.infer<TSchema>
  - citations: Record<field_path, source_ids[]>
  - confidence_notes: Record<field_path, ConfidenceNote>

6.3 Why two stages instead of one mega-extraction

  • Token economy. Per-field extraction with a curated source subset is cheaper than asking Claude to fill 30 fields from 200K tokens of sources.
  • Prompt caching. Stage 1 produces a stable plan that can be cached; Stage 2 issues parallel calls.
  • Citations are structural, not narrative. Asking "which sources back this field" per-field gives a precise answer; asking once for everything produces lossy [1, 3, 7] lists.
  • Diagnosability. When the agent's eval fails, "field X had only low-confidence sources" is actionable. "Schema validation failed" is not.

6.4 Failure modes (typed)

type ExtractionError =
  | { code: "schema_unsatisfiable"; field: string; reason: string; hint: string }
  | { code: "missing_citation"; field: string; hint: string }
  | { code: "no_candidate_sources"; field: string; hint: string }
  | { code: "low_confidence_required_field"; field: string; hint: string };

Example hint string:

"Field 'funding.total_raised_usd' had no candidate sources.
 Add a tool that hits Crunchbase or SEC filings, or relax the schema
 by making this field .nullable()."

The agent reads this and acts. Principle #4.

6.5 Override via prompts/extract.md

The default extraction prompt is opinionated. Researchers in domains with unusual norms (legal, scientific) can override it:

# Extraction priorities

When citing legal precedent:
- Prefer the original opinion over secondary commentary.
- Cite by Bluebook format in the citation map, not just URL.
- If two sources conflict, prefer the more recent.

This file is appended to the default extraction prompt, not replacing it. Principle #3 — one path, with refinement.


7. Default Tools

Every researcher gets these unless explicitly disabled:

Tool Purpose Backed by
search Web search across providers (DDG default, fallback chain) src/search.ts
inspect Fetch a URL, return content, do NOT commit as source plain-fetch → Steel fallback
fetch Fetch a URL AND commit it as a cited source plain-fetch → Steel fallback

These are exposed identically to today's Atlas, so this spec inherits their contract.

7.1 Disabling defaults

defaults: { search: false, fetch: true, inspect: true }

For domain researchers where the public web is noise (e.g., "ask our docs" against a private corpus), disable search and provide only domain tools.

7.2 MCP tools (future, §15)

Out of scope for v1. The shape: mcpServers: [...] in config, tools auto-discovered. Mentioned only so the design doesn't preclude it.


8. CLI Surface (auto-generated)

Every researcher gets a CLI for free. create-atlas-researcher writes a 3-line shim:

// bin/cli.ts (generated)
import researcher from "../atlas.config.js";
import { runCli } from "@steel-dev/atlas/cli";
runCli(researcher);

Usage:

$ npx my-researcher "<query>"                  # primary
$ npx my-researcher "<query>" --json           # data only, no narrative
$ npx my-researcher "<query>" --markdown       # narrative + data
$ npx my-researcher "<query>" --out out.json
$ npx my-researcher eval                       # run evals/seed.jsonl
$ npx my-researcher eval --update-baselines    # commit current outputs as baseline
$ npx my-researcher sync                       # regenerate README.md + .atlas/manifest.json
$ npx my-researcher inspect                    # print resolved config (tools, models, budgets)

--json is the default for piping into other agents. Principle #5 — schema-validated output is the first-class shape.


9. Evals (the inner loop)

Evals are not a nice-to-have. They are the only feedback loop the agent has to know if its researcher works. Principle #6.

9.1 Seed file: evals/seed.jsonl

One case per line. Each case has a query and an expect block:

{"query": "Profile of Anthropic", "expect": {"founded_year": 2021, "hq_location": "San Francisco"}}
{"query": "Profile of Mercor",    "expect": {"product_category": {"includes": "AI hiring"}}}
{"query": "Profile of Mistral",   "expect": {"funding.last_round.stage": {"in": ["series-a","series-b"]}}}

9.2 Expectation grammar

Each leaf in expect is one of:

  • A literal value: 2021 → strict equality
  • An object with a matcher key:
    • {"includes": x} — array contains x
    • {"in": [x, y]} — value is one of
    • {"matches": "regex"} — string regex match
    • {"approx": n, "tolerance": 0.1} — numeric tolerance
    • {"not_null": true} — value exists
    • {"semantically": "description", "model": "claude-haiku"} — LLM-as-judge with a small model

Dotted paths address nested fields. The grammar is small on purpose — the agent should always know which matcher applies. Principle #3.

9.3 Running

$ npx my-researcher eval

Running 12 cases against ai-startup-intel@0.1.0…

  ✓ Anthropic                schema:ok   matches:4/4   $0.31  18s
  ✗ Mercor                   schema:ok   matches:2/3   $0.44  22s
      - product_category: expected includes "AI hiring", got ["talent matching"]
  ✗ Mistral                  schema:fail              $0.12   8s
      - funding.last_round: required, got null
        hint: 'no_candidate_sources' — add a Crunchbase tool or relax schema

10/12 schema-valid · 8/12 expectations met · avg $0.34 · avg 19s · total $4.10

Exit code: non-zero on any failure. The agent loops:

loop:
  edit a tool / prompt / schema
  run `npx my-researcher eval`
  read failures
  → repeat until green

9.4 Baselines

$ npx my-researcher eval --update-baselines

Writes each case's output to evals/baselines/<case-id>.json. Subsequent eval runs diff against baselines for fields not covered by expect. Lets the agent track regression on non-asserted fields without committing to specific values.

9.5 Stochasticity

Researchers are stochastic. The eval runner:

  • Runs each case N=1 by default; configurable to N=3 for variance.
  • Reports pass-rate, not a single boolean, when N > 1.
  • Caches gather phase outputs by content hash so re-runs after prompt-only edits are cheap.

10. Errors (designed for agent consumption)

Every error thrown by the SDK is one of these:

type AtlasError =
  | ConfigError         // researcher misconfigured at boot
  | ToolError           // tool execution failed
  | ExtractionError     // schema-bound extraction failed (see §6.4)
  | BudgetError         // hit a budget cap
  | RuntimeError;       // anything else — generic, has a request_id

interface ConfigError {
  code: "config_invalid";
  message: string;       // for humans
  hint: string;          // for the agent — imperative, specific
  field?: string;        // which config key
  doc_anchor?: string;   // a stable URL fragment
}

Examples of good vs bad hints:

BAD:  "Schema validation failed"
GOOD: "Field 'team_size' is required but received null.
       Either: (a) mark it .nullable() in schemas/output.ts,
       or (b) add a tool that returns headcount data."

BAD:  "Tool 'linkedin_company' returned malformed output"
GOOD: "Tool 'linkedin_company' returned { handle: 'x' } but its declared output
       schema expects { employee_count, headline, recent_posts }.
       Update tools/linkedin.ts to return matching keys, or update its `output:` schema."

The hint is the AX feature. Get this right and an agent can fix its own researcher.


11. Telemetry (the observable run)

Every run emits a typed event stream:

type ResearchEvent =
  // lifecycle
  | { type: "run_started"; query: string; budget: Budget }
  | { type: "run_finished"; result: ResearchResult }
  // gather phase
  | { type: "gather_started" }
  | { type: "tool_call"; tool: string; input: unknown; call_id: string }
  | { type: "tool_result"; call_id: string; ok: boolean; latency_ms: number }
  | { type: "source_committed"; n: number; url: string; title: string }
  | { type: "gather_finished"; sources: number; tool_calls: number }
  // extract phase
  | { type: "extract_started"; fields: number; sources: number }
  | { type: "field_planned"; field: string; candidates: number[] }
  | { type: "field_filled"; field: string; confidence: string; citations: number[] }
  | { type: "extract_finished" }
  // narrative (optional)
  | { type: "narrative_started" }
  | { type: "narrative_finished"; chars: number }
  // failures
  | { type: "error"; error: AtlasError };

Events are observable via onEvent, JSON-serializable, replay-safe. The CLI's --json mode pipes them line-delimited to stderr.

11.1 Replay

$ npx my-researcher "<query>" --json 2> run.jsonl > out.json
$ npx atlas replay run.jsonl   # pretty-print the run

The agent can ingest run.jsonl and reason about why a run produced a given output without re-running it.


12. Steel Integration

Steel is the substrate. Atlas treats it as a first-class capability, not an opaque dependency.

12.1 Default browser fallback

Existing behavior: fetch/inspect use plain HTTP first, Steel only when the page requires it. Unchanged.

12.2 Named sessions

defineBrowserTool({ session: "<name>" }) references a session managed via Steel's session API. Sessions persist across runs. Atlas does not manage credentials.

# managed externally:
$ steel session create linkedin-prod --auth-flow ./flows/linkedin.ts

If a referenced session does not exist, the tool runs in unauthenticated mode and the runtime injects { unauthenticated: true } into the tool's output. The agent should branch on this.

12.3 Tool-level browser context

The page argument passed to defineBrowserTool run is a typed wrapper:

interface BrowserPage {
  goto(url: string, opts?: { waitFor?: string }): Promise<void>;
  text(selector: string): Promise<string>;
  textOrNull(selector: string): Promise<string | null>;
  allText(selector: string, opts?: { limit?: number }): Promise<string[]>;
  attr(selector: string, name: string): Promise<string | null>;
  screenshot(): Promise<Buffer>;
  waitFor(selector: string, opts?: { timeoutMs?: number }): Promise<void>;
  // Escape hatch — typed, but documented as "use sparingly":
  evaluate<T>(fn: () => T): Promise<T>;
}

No raw Playwright. No raw CDP. One path. Principle #3.


13. Composition

13.1 .asTool()

import drugInteractionResearcher from "./drug-interactions/atlas.config.js";

export default defineResearcher({
  // ...
  tools: [
    drugInteractionResearcher.asTool({
      name: "check_drug_interactions",
      description:
        "Check known interactions for a list of drugs. Returns structured " +
        "interaction data. Use before recommending any combination therapy.",
    }),
  ],
});

.asTool():

  • Reuses the sub-researcher's output schema as the tool's output schema.
  • Inherits the parent researcher's signal for cancellation.
  • Accounts its cost against the parent's budget.
  • Suppresses its own CLI/eval surface.

13.2 Recursion limits

Composition is bounded:

  • Default maxDepth: 3 for sub-researcher calls.
  • Cycles detected and rejected at config time (graph walk).
  • budget is enforced cumulatively across the entire researcher tree.

14. The Scaffold (create-atlas-researcher)

$ npx create-atlas-researcher ai-startup-intel

Interactive prompts:

  1. Output kind? structured-only | structured-with-narrative | narrative-only (mapped to which models are enabled)
  2. Steel sessions needed? no | yes — list names
  3. Eval set kind? empty | seed-from-examples (pulls 5 cases from a curated registry of public examples per domain hint)

Writes the full layout from §3, populated with working defaults. The agent's first action is npx <name> eval — the seed should pass on a blank profile so the agent has a known-good baseline.

14.1 Templates (presets baked into the scaffold)

create-atlas-researcher --template <name> skips prompts. Initial templates:

  • competitive-intel
  • medical-literature
  • legal-precedent
  • internal-docs-rag (assumes a corpus tool)
  • blank

Templates are an AX-friendly form of presets — the agent gets a working researcher to mutate, not a blank config to fill. Principle #1.


15. Migration from current Atlas

Current (v0.1) Spec target Notes
research({ query }) single function defineResearcher(...).run({ query }) Current research() stays as presets.web.run({ query }) for compat
Markdown-only output Schema + markdown Markdown becomes optional fallback
Single gather agent writes report Gather → Extract → Validate The major architectural change
Tools hardcoded in tools.ts defineTool + defineBrowserTool factories Tools become public API
Models hardcoded constants Per-phase model config gather/extract/narrative
No eval harness evals/seed.jsonl + CLI New
No scaffold create-atlas-researcher New package
Steel as private impl detail Steel sessions as named, declared resources New session model

The current research() survives as presets.web — it remains the "deep research that just works" baseline.


16. What's explicitly NOT in scope (v1)

These are valuable but excluded from v1 to keep the surface small. Principle #3.

  • MCP server support. Architecturally fine to add later. Not v1.
  • Multi-model providers. Anthropic-only. OpenAI/Gemini support is a hard fork, not a v1 feature.
  • Streaming partial schemas. Schema-bound extraction returns a complete object or fails. No streaming.
  • A web UI / dashboard. CLI + filesystem only.
  • Hosted runtime. Atlas is a library. Running researchers in production is the user's problem (with signal and structured events to make it tractable).
  • Tool marketplaces. Tools are files in a repo. No registry.

17. Open questions

These are unresolved and affect the design materially. Flagged for human review.

  1. Markdown imports. import x from "./x.md" with { type: "text" } is stage-3-ish but inconsistent across runtimes. Fallback: a build step that compiles .md.ts exporting a string. Decision needed.
  2. Citation granularity. Currently per-field. Should it be per-clause (sentence-level)? Probably yes for narrative mode, no for structured. Confirm.
  3. Eval LLM-as-judge model. Defaulting to Haiku for cost. Should the judge model be configurable per-case? Probably yes — but raises eval determinism concerns.
  4. Steel session ownership. Atlas references sessions by name but does not create them. Does Atlas need a create-session flow for AX, or does Steel CLI cover it? Probably Steel CLI.
  5. Schema migrations. When a researcher's schema changes, what happens to existing eval baselines? Need a migration story before researchers ship to production.
  6. Cost prediction. Today's budget is enforced reactively (kill on exceed). Should there be a "dry run" that estimates cost before spending? Useful but expensive to model accurately.
  7. The narrative phase. Schema is primary, narrative is fallback — but for a human-shaped product (a research report), is that the wrong default? Maybe kind: "report" | "extract" at the researcher level disambiguates.
  8. Multi-tenant Steel. If a researcher runs as a service serving many users, sessions need to be per-user, not per-researcher. Out of v1 scope but the API shape should not preclude it.

18. Acceptance criteria for v1

A v1 ship requires all of these. Each is testable.

  • npx create-atlas-researcher <name> produces a working researcher in <30s.
  • The blank researcher's eval command passes its seed cases.
  • A coding agent (Claude Code, given only the README, the type declarations, and an example template) can add a new tool, update the schema, and pass evals — without human intervention beyond the initial goal.
  • Every error thrown by the SDK has a typed code and a hint that names a file path and a concrete edit.
  • run is cancellable mid-tool-call via AbortSignal.
  • Existing @steel-dev/atlas users can opt into the new pipeline without rewriting (via presets.web).
  • The full event stream replays deterministically given the same source cache.
  • All public types have JSDoc with at least one example each.
  • .d.ts bundle size <30KB (agents read these).

19. Appendix: Worked example — ai-startup-intel

End-to-end. This is the example the README points the agent at.

atlas.config.ts

import { defineResearcher } from "@steel-dev/atlas";
import { companyProfile } from "./schemas/output.js";
import { linkedinCompany } from "./tools/linkedin.js";
import { crunchbase } from "./tools/crunchbase.js";
import systemPrompt from "./prompts/system.md" with { type: "text" };

export default defineResearcher({
  name: "ai-startup-intel",
  version: "0.1.0",
  schema: companyProfile,
  prompt: systemPrompt,
  tools: [linkedinCompany, crunchbase],
  budget: { maxToolCalls: 30, maxUsd: 2 },
});

schemas/output.ts

import { z, citable } from "@steel-dev/atlas/schema";

export const companyProfile = z.object({
  name: citable(z.string()),
  url: citable(z.string().url()),
  one_liner: citable(z.string().max(280)).describe(
    "Plain-English description in ≤280 chars. Prefer the company's own self-description."
  ),
  founded_year: citable(z.number().int().min(1900).max(2100).nullable()),
  hq_location: citable(z.string().nullable()),
  funding: z.object({
    total_raised_usd: citable(z.number().nullable()),
    last_round: citable(z.object({
      stage: z.enum(["pre-seed","seed","series-a","series-b","series-c+","unknown"]),
      amount_usd: z.number().nullable(),
      date: z.string().nullable(),
      lead_investor: z.string().nullable(),
    }).nullable()),
  }),
  team_size: citable(z.number().int().nullable()),
  product_category: citable(z.array(z.string()).max(5)),
  differentiators: citable(z.array(z.string()).max(3)).describe(
    "Specific, concrete differentiators. Avoid marketing language like 'AI-powered' or 'enterprise-grade'."
  ),
});

prompts/system.md

You are an AI-startup intelligence analyst working for a VC associate.

## Priorities (in order)
1. Verify funding numbers with at least 2 independent sources.
2. Prefer primary sources (company site, SEC filings, press releases) over aggregators.
3. If a field cannot be confidently determined, return null. Never guess.
4. Use `linkedin_company` for team size before falling back to web search.
5. Use `crunchbase` for funding before falling back to news articles.

## Stop conditions
- Every required field has a value or a justified null.
- Every non-null field has ≥1 candidate source for citation.

tools/linkedin.ts

import { defineBrowserTool } from "@steel-dev/atlas";
import { z } from "zod";

export const linkedinCompany = defineBrowserTool({
  name: "linkedin_company",
  description:
    "Fetch a LinkedIn company page. Use for authoritative employee count and recent " +
    "hiring signals. Falls back to public-only data if no session is bound.",
  input: z.object({
    handle: z.string().describe('LinkedIn handle, e.g. "anthropicai"'),
  }),
  output: z.object({
    employee_count: z.string().nullable(),
    headline: z.string().nullable(),
    recent_posts: z.array(z.string()).max(10),
  }),
  session: "linkedin-prod",
  run: async ({ handle }, { page }) => {
    await page.goto(`https://linkedin.com/company/${handle}`, {
      waitFor: 'h1.org-top-card-summary__title',
    });
    return {
      employee_count: await page.textOrNull('[data-test=employee-count]'),
      headline: await page.textOrNull('h2.org-top-card-summary__tagline'),
      recent_posts: await page.allText('.org-update-card', { limit: 10 }),
    };
  },
});

evals/seed.jsonl

{"id": "anthropic", "query": "Profile of Anthropic", "expect": {"founded_year": 2021, "hq_location": "San Francisco", "product_category": {"includes": "AI assistant"}}}
{"id": "mercor",    "query": "Profile of Mercor",    "expect": {"product_category": {"includes": "AI hiring"}}}
{"id": "mistral",   "query": "Profile of Mistral AI", "expect": {"hq_location": {"matches": "Paris|France"}}}

What the agent does

  1. Reads README → understands the layout.
  2. Runs npx ai-startup-intel eval → sees baseline pass rate.
  3. User says "add a check for whether the company has a public GitHub org."
  4. Agent:
    • Adds github_org: citable(z.string().url().nullable()) to schemas/output.ts.
    • Writes tools/github.ts returning org metadata.
    • Updates prompts/system.md to mention GitHub.
    • Adds an eval case asserting github_org for one known company.
    • Runs npx ai-startup-intel eval.
    • Reads failures, iterates until green.
  5. Ships.

That loop — edit → eval → iterate — is the entire AX value proposition.


End of specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment