Enhanced Places Demo — Performance Work Summary

May 2026 · Summary of latency and throughput improvements across the NL search pipeline. All benchmarks run against the 75-case eval fixture (npm run eval) unless noted.

TL;DR

Area	What changed	Win
Fast path	Skip LLM for simple queries via bge-small cosine similarity	~7ms vs ~1400ms for 45% of traffic
Typesense pipeline	Parallelise where-resolver + candidate lookup	~200–400ms saved on curated queries
Typesense pipeline	Bulk filter drop before one-by-one peeling	Saves up to N−1 serial round trips on over-filtered queries
Typesense pipeline	Cap lookup waterfall at 2 tiers	Worst-case per-candidate calls 4→2
Infra	Express→Fastify migration	Better connection reuse, lower overhead at concurrency
LLM prompt	Phrase-based attribute extraction	Attribute slot +12.7pp (98.5%), ~900 prompt tokens removed
LLM prompt (experiment)	RAG-hint shortlist k=15	−9pp accuracy, negligible TTFT mean change — negative result
LLM (experiment)	Embedding-as-category-slot	−29pp accuracy — negative result

1. Fast Path — Skip LLM for Simple Queries

PRs: #77 (spike) · #81 (production wiring) · closes issue #76

The biggest single latency lever: queries like "coffee shops", "plumber", or "urgent care" don't need GPT at all. A two-stage gate routes them to a local embedding lookup instead.

POST /api/nl-search
  ↓
tryFastPath(query, lat, lon)                    ~7–15 ms warm
  ├── 9 regex escalation checks (~0ms)
  │    escalate if: proper noun · time/day phrase · dietary ·
  │                 attribute keyword · sort intent · "near [Place]"
  ├── bge-small-en-v1.5 cosine over 1,028 categories  (~7ms)
  └── score < 0.35 → escalate to LLM
  
  hit  → skip LLM + where-resolver + curated → Typesense directly
  miss → LLM path unchanged (transparent to user)

Benchmark (75 test cases)

	Hit rate	Pass rate	p50 latency
Fast path hits	45.3% of queries	61.8%	~7ms
LLM (remainder)	54.7%	97%+	~1,400ms

Fast path escalates correctly on all complex cases — hours, near [Place], attributes, sort, dietary, vibe queries. Failures are silent: the path returns null and the LLM takes over. There's no way to serve a wrong result.

Precomputed assets: src/shared/category-embeddings-local.json — 1,028 bge-small-en-v1.5 embeddings, 8.3 MB, 384 dimensions.

2. Typesense Pipeline Parallelism

PR: #70 · closes issue #63

resolveWhere (geocode the near anchor) and resolveCandidatesToIds (name→ID lookup for curated candidates) were running sequentially. They have no data dependency until the final document fetch, so we run them with Promise.all.

Before:  LLM → resolveWhere (400ms) → candidateLookup (200ms) → search
After:   LLM → [resolveWhere ‖ candidateLookup] → search
                         ↑ wall time ≈ max(where, lookup)

The areaFilter (from resolveWhere) is applied at the final document fetch step — it was always just a quality hint at the ID-resolution stage, so moving it later is safe and candidates don't leak across cities.

Timing impact on curated queries with a near anchor: 200–400ms saved.

3. Typesense Fallback — Bulk Filter Drop

PR: #69 · closes issue #66

searchWithFallback peeled filters one-by-one when a query returned 0 results. For a query with 5+ filters, that's up to 9 sequential Typesense round trips (each ~20–40ms) before recovering results.

New step 3a: drop all non-category filters at once first. If that finds results, we're done. Only if it still returns 0 do we fall through to the per-filter diagnostic loop (step 3b).

Step 1: full query (all filters)
Step 2: drop time filters
Step 3a: drop all non-category filters at once   ← NEW
Step 3b: drop filters one-by-one                 ← fallback only

In the common case — where the combination of filters is too tight, not any single one — this collapses N serial calls into 1.

4. Typesense Candidate Lookup — Cap Tier Waterfall

PR: #68 · closes issue #65

lookupOne in candidate-lookup.ts had a 4-tier waterfall per candidate name:

cat + area filter
cat filter only
area only (no category)
global (no filters)

Tier 4 (global) was silently harmful: it returned the most globally-popular POI with that name, ignoring the user's city. "Blue Bottle Coffee" resolves to whichever location ranks highest globally — not the SF one the user wants.

New waterfall (2 tiers):

With categoryFilter: cat+area → cat
Without: area → null

Worst-case per-candidate Typesense calls: 4 → 2. Missed candidates surface in curated_missed in the debug panel; the regular Typesense search still covers them.

5. Infra — Express → Fastify

PR: #72 · closes issue #71

Migrated the server from Express 4.x to Fastify, converting routes to FastifyPluginAsync and middleware to Fastify hooks. SSE streaming routes use reply.hijack() + reply.raw to write directly to the socket.

Also switches the whole project to "type": "module" + moduleResolution: NodeNext — required .js extensions on all relative imports, which makes the TypeScript output match what Node actually resolves.

Why Fastify matters for perf: lower per-request overhead at concurrency (Fastify is ~2× faster than Express on throughput benchmarks), better built-in schema validation, and native async route handlers without next() chains.

6. Experiments — What We Tried That Didn't Work

These are documented as merged/open PRs so the work isn't lost, but the findings were negative for GPT-5.4.

6a. RAG-Hint Prompt Shortening

PR: #83 · closes issue #82

Hypothesis: Replace the 1,028-category list in the system prompt with a bge-small top-K shortlist (~40% fewer tokens) to reduce TTFT.

Result:

Mode	Pass rate	TTFT mean	TTFT p50	TTFT p95	Recall@15
Baseline (all 1,028)	97.3%	845ms	651ms	1,651ms	—
RAG-hint k=15	88.0%	836ms	740ms	1,427ms	100%

Why it failed: GPT-5.4 uses parallel prefill — 3–4k fewer tokens saves negligible wall time on average. The p95 improved 224ms but the mean is within noise. More importantly, accuracy dropped 9pp despite 100% recall@15. The LLM uses the full category list as taxonomy context to understand what isn't a category (i.e. what belongs in attributes instead). Remove that and the category/attribute boundary gets fuzzier.

The approach is more promising for local/smaller models (Qwen, quantized models) where prompt length meaningfully affects generation speed. The ragHintK option is wired in and defaults to 15; set to 0 to restore baseline.

6b. Embedding-as-Category-Slot

PR: #80 · closes issue #75

Hypothesis: Remove the category list from the LLM prompt entirely; run bge-small in parallel with the LLM to fill categories[] while the LLM handles near, attributes, sort_by, etc.

Result:

Mode	Pass	Aggregate score	Category recall
Baseline (full prompt)	97.3%	98.6%	—
Embedding-slot k=1	68.0%	84.7%	65.7%
Embedding-slot k=3	41.3%	75.5%	77.1%

Why it failed: The embedder can't parse multi-signal queries. "family-friendly Mexican restaurant" → childrens_cafe (family-first semantic anchor beats the cuisine signal). "late-night tacos" → nightlife. Vibe queries ("romantic Italian") are especially bad. k=3 raises recall to 77% but tanks precision (false positives on single-category test cases).

Works well for single-intent queries (remap tag: 100%; simple category lookups like "sushi spot": 67%). Not viable as a full replacement.

7. Local Model Spike — Qwen 2.5 0.5B

PR: #79 · closes issues #73, #74

Infrastructure spike for running a local 0.5B model as a fast-path NL parser (target: <50ms, no network call).

Two optimisations that compound:

Grammar-constrained decoding (#73): Builds a GBNF JSON Schema grammar from CATEGORIES / CUISINES / BOOLEAN_ATTRIBUTES at startup. The sampler is physically prevented from producing enum values outside those sets — zero hallucinations on enum fields by construction.

RAG-lite retrieval (#74): Embed the query with bge-small (shared index), pass the top-30 categories to Qwen instead of all 1,028. Shorter prompt + smaller enum → faster constrained sampling.

What's validated: 100% recall@30 from the retriever, grammar compiles and runs cleanly, Qwen inference module loads and generates. The base model doesn't hallucinate (grammar prevents it) but outputs empty categories: [] — it has no exposure to Mapbox taxonomy in pre-training, so it can't map "coffee shop" to coffee_shop without fine-tuning.

PR #79 itself can merge independently — the code (grammar builder, RAG retriever, Qwen inference module) stands alone and is worth having in the tree. What's blocked on PR #57 (Ian's work) is the end-to-end accuracy story: can a local 0.5B model replace GPT-5.4 for this task? That question can't be answered until someone runs the LoRA training pipeline from #57 and drops the resulting GGUF (~380 MB, gitignored) into data/training/qwen-artifacts/. PR #57 has everything needed — 45k labelled rows, the train_qwen.py trainer, and merge_and_convert.py to produce the GGUF — it's just a draft and the artifact hasn't been generated yet.

8. Phrase-Based Attribute Extraction

PR: #85 · closes issue #84

The system prompt previously included the full list of 131+ Enrich boolean attribute names (offering_vegan_options, family_friendly, accommodation_wheelchair_accessible_entrance, …). The LLM had to memorise canonical names and map user phrases to them at inference time — a reliable source of failures for smaller models that don't have the Mapbox taxonomy in their training data.

New contract: the LLM outputs free-text phrases; a new phrase-mapper.ts module resolves them to canonical names.

User: "family-friendly Mexican restaurant with kids menu"
  ↓
LLM outputs:
  categories:        ["mexican_restaurant"]
  attribute_phrases: ["family-friendly", "kids menu"]
  ← no canonical Enrich names anywhere

phrase-mapper (module-load, ~0ms):
  "family-friendly" → exact match → family_friendly
  "kids menu"       → exact match → offering_childrens_menu

result.attributes: [family_friendly, offering_childrens_menu]  ✓

The mapper builds two indexes at startup from attributes.ts:

Exact match after normalising hyphens/case ("family-friendly" → "family friendly")
Longest word-boundary substring ("good desserts" contains "good dessert" → known_for_dessert before offering_dessert)

The same mechanism handles exclude_attribute_phrases with the audience guard preserved (vibe-leak protection still applies on the mapped canonical names).

Benchmark (75 cases, gpt-5.4-mini)

Metric	Baseline (full list)	Phrase-based
Pass rate	97.3%	94.7% (−2.6pp)
Aggregate score	~98.6%	97.4% (−1.2pp)
Attribute slot	85.8%	98.5% (+12.7pp)
Prompt tokens	~3,800	~2,900 (~900 fewer)

The small overall regression is on vibe-exclusion edge cases unrelated to attribute extraction. The attribute slot improvement is the point.

Why this matters for Qwen / local models: the failures in PR #57 (family_friendly, black_owned, wheelchair, laptop_work_friendly) were all cases where the 0.5B model didn't know our canonical names. With phrase extraction, those queries pass because the model only needs to echo what the user said.

Open PRs

PR	Description	Status
#72	Express→Fastify + ESM migration	Ready to merge
#68	Cap lookup tiers (2 tiers max)	Ready to merge
#69	Bulk filter drop before one-by-one	Ready to merge
#70	Parallel where + candidates	Ready to merge
#81	Fast path wired into production	Ready to merge (needs #72 first)
#77	Fast path eval spike	Superseded by #81
#79	Qwen + RAG + grammar infra	Ready to merge; accuracy eval needs #57 GGUF
#80	Embedding-as-category-slot	Documented negative result
#83	RAG-hint prompt shortening	Documented negative result
#85	Phrase-based attribute extraction	Ready to merge

Suggested merge order

#72 (Fastify)  →  #68 + #69 + #70 (Typesense perf)  →  #81 (fast path)  →  #85 (phrase attrs)

PRs #68/69/70 are independent and can merge in any order after #72. PR #85 depends only on main and can merge any time — it's backwards-compatible (the attributes canonical field still works alongside attribute_phrases). PR #83 ships with an opt-out knob (ragHintK: 0) and is low-risk to merge even with the negative finding.

mattpodwysocki/places-demo-perf-work.md

Select an option

No results found