May 2026 · Summary of latency and throughput improvements across the NL search pipeline. All benchmarks run against the 75-case eval fixture (
npm run eval) unless noted.
| Area | What changed | Win |
|---|---|---|
| Fast path | Skip LLM for simple queries via bge-small cosine similarity | ~7ms vs ~1400ms for 45% of traffic |
| Typesense pipeline | Parallelise where-resolver + candidate lookup | ~200–400ms saved on curated queries |
| Typesense pipeline | Bulk filter drop before one-by-one peeling | Saves up to N−1 serial round trips on over-filtered queries |
| Typesense pipeline | Cap lookup waterfall at 2 tiers | Worst-case per-candidate calls 4→2 |
| Infra | Express→Fastify migration | Better connection reuse, lower overhead at concurrency |
| LLM prompt | Phrase-based attribute extraction | Attribute slot +12.7pp (98.5%), ~900 prompt tokens removed |
| LLM prompt (experiment) | RAG-hint shortlist k=15 | −9pp accuracy, negligible TTFT mean change — negative result |
| LLM (experiment) | Embedding-as-category-slot | −29pp accuracy — negative result |
PRs: #77 (spike) · #81 (production wiring) · closes issue #76
The biggest single latency lever: queries like "coffee shops", "plumber", or "urgent care" don't need GPT at all. A two-stage gate routes them to a local embedding lookup instead.
POST /api/nl-search
↓
tryFastPath(query, lat, lon) ~7–15 ms warm
├── 9 regex escalation checks (~0ms)
│ escalate if: proper noun · time/day phrase · dietary ·
│ attribute keyword · sort intent · "near [Place]"
├── bge-small-en-v1.5 cosine over 1,028 categories (~7ms)
└── score < 0.35 → escalate to LLM
hit → skip LLM + where-resolver + curated → Typesense directly
miss → LLM path unchanged (transparent to user)
| Hit rate | Pass rate | p50 latency | |
|---|---|---|---|
| Fast path hits | 45.3% of queries | 61.8% | ~7ms |
| LLM (remainder) | 54.7% | 97%+ | ~1,400ms |
Fast path escalates correctly on all complex cases — hours, near [Place], attributes, sort, dietary, vibe queries.
Failures are silent: the path returns null and the LLM takes over. There's no way to serve a wrong result.
Precomputed assets: src/shared/category-embeddings-local.json — 1,028 bge-small-en-v1.5 embeddings, 8.3 MB, 384 dimensions.
resolveWhere (geocode the near anchor) and resolveCandidatesToIds (name→ID lookup for curated candidates) were running sequentially. They have no data dependency until the final document fetch, so we run them with Promise.all.
Before: LLM → resolveWhere (400ms) → candidateLookup (200ms) → search
After: LLM → [resolveWhere ‖ candidateLookup] → search
↑ wall time ≈ max(where, lookup)
The areaFilter (from resolveWhere) is applied at the final document fetch step — it was always just a quality hint at the ID-resolution stage, so moving it later is safe and candidates don't leak across cities.
Timing impact on curated queries with a near anchor: 200–400ms saved.
searchWithFallback peeled filters one-by-one when a query returned 0 results. For a query with 5+ filters, that's up to 9 sequential Typesense round trips (each ~20–40ms) before recovering results.
New step 3a: drop all non-category filters at once first. If that finds results, we're done. Only if it still returns 0 do we fall through to the per-filter diagnostic loop (step 3b).
Step 1: full query (all filters)
Step 2: drop time filters
Step 3a: drop all non-category filters at once ← NEW
Step 3b: drop filters one-by-one ← fallback only
In the common case — where the combination of filters is too tight, not any single one — this collapses N serial calls into 1.
lookupOne in candidate-lookup.ts had a 4-tier waterfall per candidate name:
cat + areafiltercatfilter only- area only (no category)
- global (no filters)
Tier 4 (global) was silently harmful: it returned the most globally-popular POI with that name, ignoring the user's city. "Blue Bottle Coffee" resolves to whichever location ranks highest globally — not the SF one the user wants.
New waterfall (2 tiers):
- With
categoryFilter:cat+area → cat - Without:
area → null
Worst-case per-candidate Typesense calls: 4 → 2. Missed candidates surface in curated_missed in the debug panel; the regular Typesense search still covers them.
Migrated the server from Express 4.x to Fastify, converting routes to FastifyPluginAsync and middleware to Fastify hooks. SSE streaming routes use reply.hijack() + reply.raw to write directly to the socket.
Also switches the whole project to "type": "module" + moduleResolution: NodeNext — required .js extensions on all relative imports, which makes the TypeScript output match what Node actually resolves.
Why Fastify matters for perf: lower per-request overhead at concurrency (Fastify is ~2× faster than Express on throughput benchmarks), better built-in schema validation, and native async route handlers without next() chains.
These are documented as merged/open PRs so the work isn't lost, but the findings were negative for GPT-5.4.
Hypothesis: Replace the 1,028-category list in the system prompt with a bge-small top-K shortlist (~40% fewer tokens) to reduce TTFT.
Result:
| Mode | Pass rate | TTFT mean | TTFT p50 | TTFT p95 | Recall@15 |
|---|---|---|---|---|---|
| Baseline (all 1,028) | 97.3% | 845ms | 651ms | 1,651ms | — |
| RAG-hint k=15 | 88.0% | 836ms | 740ms | 1,427ms | 100% |
Why it failed: GPT-5.4 uses parallel prefill — 3–4k fewer tokens saves negligible wall time on average. The p95 improved 224ms but the mean is within noise. More importantly, accuracy dropped 9pp despite 100% recall@15. The LLM uses the full category list as taxonomy context to understand what isn't a category (i.e. what belongs in attributes instead). Remove that and the category/attribute boundary gets fuzzier.
The approach is more promising for local/smaller models (Qwen, quantized models) where prompt length meaningfully affects generation speed. The ragHintK option is wired in and defaults to 15; set to 0 to restore baseline.
Hypothesis: Remove the category list from the LLM prompt entirely; run bge-small in parallel with the LLM to fill categories[] while the LLM handles near, attributes, sort_by, etc.
Result:
| Mode | Pass | Aggregate score | Category recall |
|---|---|---|---|
| Baseline (full prompt) | 97.3% | 98.6% | — |
| Embedding-slot k=1 | 68.0% | 84.7% | 65.7% |
| Embedding-slot k=3 | 41.3% | 75.5% | 77.1% |
Why it failed: The embedder can't parse multi-signal queries. "family-friendly Mexican restaurant" → childrens_cafe (family-first semantic anchor beats the cuisine signal). "late-night tacos" → nightlife. Vibe queries ("romantic Italian") are especially bad. k=3 raises recall to 77% but tanks precision (false positives on single-category test cases).
Works well for single-intent queries (remap tag: 100%; simple category lookups like "sushi spot": 67%). Not viable as a full replacement.
PR: #79 · closes issues #73, #74
Infrastructure spike for running a local 0.5B model as a fast-path NL parser (target: <50ms, no network call).
Two optimisations that compound:
Grammar-constrained decoding (#73): Builds a GBNF JSON Schema grammar from CATEGORIES / CUISINES / BOOLEAN_ATTRIBUTES at startup. The sampler is physically prevented from producing enum values outside those sets — zero hallucinations on enum fields by construction.
RAG-lite retrieval (#74): Embed the query with bge-small (shared index), pass the top-30 categories to Qwen instead of all 1,028. Shorter prompt + smaller enum → faster constrained sampling.
What's validated: 100% recall@30 from the retriever, grammar compiles and runs cleanly, Qwen inference module loads and generates. The base model doesn't hallucinate (grammar prevents it) but outputs empty categories: [] — it has no exposure to Mapbox taxonomy in pre-training, so it can't map "coffee shop" to coffee_shop without fine-tuning.
PR #79 itself can merge independently — the code (grammar builder, RAG retriever, Qwen inference module) stands alone and is worth having in the tree. What's blocked on PR #57 (Ian's work) is the end-to-end accuracy story: can a local 0.5B model replace GPT-5.4 for this task? That question can't be answered until someone runs the LoRA training pipeline from #57 and drops the resulting GGUF (~380 MB, gitignored) into data/training/qwen-artifacts/. PR #57 has everything needed — 45k labelled rows, the train_qwen.py trainer, and merge_and_convert.py to produce the GGUF — it's just a draft and the artifact hasn't been generated yet.
The system prompt previously included the full list of 131+ Enrich boolean attribute names (offering_vegan_options, family_friendly, accommodation_wheelchair_accessible_entrance, …). The LLM had to memorise canonical names and map user phrases to them at inference time — a reliable source of failures for smaller models that don't have the Mapbox taxonomy in their training data.
New contract: the LLM outputs free-text phrases; a new phrase-mapper.ts module resolves them to canonical names.
User: "family-friendly Mexican restaurant with kids menu"
↓
LLM outputs:
categories: ["mexican_restaurant"]
attribute_phrases: ["family-friendly", "kids menu"]
← no canonical Enrich names anywhere
phrase-mapper (module-load, ~0ms):
"family-friendly" → exact match → family_friendly
"kids menu" → exact match → offering_childrens_menu
result.attributes: [family_friendly, offering_childrens_menu] ✓
The mapper builds two indexes at startup from attributes.ts:
- Exact match after normalising hyphens/case ("family-friendly" → "family friendly")
- Longest word-boundary substring ("good desserts" contains "good dessert" →
known_for_dessertbeforeoffering_dessert)
The same mechanism handles exclude_attribute_phrases with the audience guard preserved (vibe-leak protection still applies on the mapped canonical names).
| Metric | Baseline (full list) | Phrase-based |
|---|---|---|
| Pass rate | 97.3% | 94.7% (−2.6pp) |
| Aggregate score | ~98.6% | 97.4% (−1.2pp) |
| Attribute slot | 85.8% | 98.5% (+12.7pp) |
| Prompt tokens | ~3,800 | ~2,900 (~900 fewer) |
The small overall regression is on vibe-exclusion edge cases unrelated to attribute extraction. The attribute slot improvement is the point.
Why this matters for Qwen / local models: the failures in PR #57 (family_friendly, black_owned, wheelchair, laptop_work_friendly) were all cases where the 0.5B model didn't know our canonical names. With phrase extraction, those queries pass because the model only needs to echo what the user said.
| PR | Description | Status |
|---|---|---|
| #72 | Express→Fastify + ESM migration | Ready to merge |
| #68 | Cap lookup tiers (2 tiers max) | Ready to merge |
| #69 | Bulk filter drop before one-by-one | Ready to merge |
| #70 | Parallel where + candidates | Ready to merge |
| #81 | Fast path wired into production | Ready to merge (needs #72 first) |
| #77 | Fast path eval spike | Superseded by #81 |
| #79 | Qwen + RAG + grammar infra | Ready to merge; accuracy eval needs #57 GGUF |
| #80 | Embedding-as-category-slot | Documented negative result |
| #83 | RAG-hint prompt shortening | Documented negative result |
| #85 | Phrase-based attribute extraction | Ready to merge |
#72 (Fastify) → #68 + #69 + #70 (Typesense perf) → #81 (fast path) → #85 (phrase attrs)
PRs #68/69/70 are independent and can merge in any order after #72. PR #85 depends only on main and can merge any time — it's backwards-compatible (the attributes canonical field still works alongside attribute_phrases). PR #83 ships with an opt-out knob (ragHintK: 0) and is low-risk to merge even with the negative finding.