Skip to content

Instantly share code, notes, and snippets.

@giacaglia
Created May 15, 2026 03:51
Show Gist options
  • Select an option

  • Save giacaglia/d029b464d09f77f7b03c8afab9d698e2 to your computer and use it in GitHub Desktop.

Select an option

Save giacaglia/d029b464d09f77f7b03c8afab9d698e2 to your computer and use it in GitHub Desktop.
Field test: running three customer projects through Tiros in a day

Field test: running three customer projects through Tiros in a day

Tiros is an AI engineering assistant for process plants — it ingests a project's source documents (drawings, datasheets, vendor packs) and produces structured engineering deliverables grounded in the relevant standards.

What we did

We ran the same Tiros pipeline end-to-end against three different customer engagements in a single day:

  1. A steam-turbine feasibility study for a refinery operator
  2. A hydrogen-production feasibility study for a renewables developer
  3. A water-treatment plant automation upgrade for a utility

Each customer arrived with a different shape of input (size, format, language, completeness) and a different engineering domain (refinery processes, hydrogen / electrolysis, water treatment).

The one-line summary per project

Project Input Outcome
Project A (turbine) 4 documents, very thin The question pack is the deliverable
Project B (hydrogen) 184 files, but the wrong domain for our system Discovered our system doesn't fit; that discovery was itself useful
Project C (water) 69 files, mature engagement, real plant drawings Genuine workflow output; multiple deliverables ran successfully

The three things we learned

1. Sometimes there isn't enough input to produce a finished deliverable

Customers often arrive at the early "feasibility study" stage. They have a brochure, a vendor email, a few operating-condition tables — but no plant drawings, no equipment list, no hazardous-substance register, no piping diagrams. Our system can structure what's there, but it cannot manufacture content from nothing.

What we found: in those cases, the most valuable output isn't a generated deliverable — it's a structured list of questions to send the customer. Our system aggregates 50–250 deduplicated customer questions across all the deliverables it would have produced, prioritised by how many downstream documents each question unblocks. That list collapses what's normally a two-week email thread into one round-trip.

Generalisable lesson: the value proposition for early-stage customers is "we'll send your customer one structured question pack instead of you emailing them ad-hoc for two months." That's different from the value proposition for late-stage customers, which is "we'll generate the deliverables."

2. Our system was built for one engineering domain — others don't fit cleanly

The system was designed for refinery / petrochemical process plants. It ships with equipment classes (pumps, vessels, exchangers, columns), safety methodology (HAZOP per IEC 61882, materials per NACE for sour service, relief per API 521), and a ~74-workflow catalogue all targeting that domain.

When we ran it against a hydrogen-production plant (electrolyser + battery storage + photovoltaics), none of the workflows transferred cleanly. The failure modes, standards, and equipment classes are fundamentally different.

What we found:

  • Our system honestly identified the domain mismatch instead of churning out irrelevant content. This is the correct behaviour — better to say "this isn't our domain" than to fabricate plausible-looking deliverables.
  • The mismatch surfaced exactly what we'd need to add to handle the new domain. That diagnosis became four new "equipment-class starter" packs that any future hydrogen project will benefit from.

For the water-treatment project, the picture was much better. Water treatment shares enough with refinery process (pumps, vessels, tanks, materials, corrosion) that 3 of our 11 workflows ran cleanly and 4 more ran in degraded mode. Different domain, partial fit.

Generalisable lesson: the system has a "native domain" where everything works, an "adjacent domain" where most things work with caveats, and a "wrong domain" where almost nothing fits. Pricing, marketing, and onboarding need to set expectations differently for each.

3. Setup cost (per-machine, per-customer) is brittle

Three integration issues surfaced during the runs:

  • An auth handler silently failed on plant-drawing conversion. Graceful-degradation logic masked it from the user, but two runs were affected before we caught it.
  • Missing dependencies in one sub-tool's package definition — a fresh clone would crash on first use.
  • Path encoding mismatch (macOS uses one Unicode form, the system used another) — caused workflow-name lookups to fail silently.

All three were edge-case integration issues, not algorithmic ones. But all three would have repeatedly hit any new operator running the system.

Generalisable lesson: as we onboard more operators — internal team → external contractors → customer engineers — the cost of broken setup multiplies. We need a doctor / health-check command that catches these gotchas at setup time instead of letting operators discover them mid-customer-run.

What we improved (shipped today)

Each customer run revealed a class of problem; we shipped a small fix for each. Six improvements, all small, none algorithmic:

Improvement Triggered by
Per-machine setup script that installs Python + tool dependencies Refinery run — needed setup tooling
"Equipment class starter" pattern for class-typical engineering knowledge Refinery run — first starter (steam turbine)
Hydrogen-domain class starters (electrolyser, battery, hydrogen storage, PV) Hydrogen run — surfaced the domain gap
Setup-time check that warns if API auth isn't configured Both feasibility runs — silent auth failures
Auth-passthrough fix for the plant-drawing converter Water run — 2 plant drawings rescued after the fix
Cost reductions: cheaper plan-phase dispatch + skip decorative images All 3 runs — observed wasted spend on low-value content

What's next

Three things would meaningfully improve the next customer run:

A. Land the remaining domain + cost improvements

  • The hydrogen-domain class starters unlock hydrogen-domain projects without re-discovering the same class-typical knowledge each time.
  • The cost-reduction work cuts spend per customer roughly in half on the next run.

B. Build a real "first-pass" workflow for thin packs

The thin-pack pattern (Project A) is going to repeat. We should explicitly support "feasibility-stage entry" as a first-class mode that:

  • Skips most workflows by default (because they'll all be unanswerable anyway)
  • Runs only the question-pack roll-up logic
  • Outputs a polished customer-question email in 5 minutes for ~$1

This is the simplest, highest-leverage product to wrap around what the system already does well.

C. Auto-detect customer domain at scaffold time

Currently every project runs against all 11 workflows regardless of domain fit. A small "what is this customer's domain?" step at project setup would let us skip wholesale workflow categories that obviously don't apply (e.g., refinery HAZOP for a hydrogen plant). Saves operator time and system tokens.


The numbers

  • 3 customer projects exercised in 1 day
  • 187 source files total ingested across the three
  • 6 improvements shipped
  • 4 real bugs found and fixed
  • ~$13.50 total LLM spend across all three runs (closer to $6–7 once the cost reductions ship)
  • 5 new equipment classes scaffolded (1 refinery turbine, 4 hydrogen-stack)
  • ~250 customer questions surfaced across the three projects
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment