This is a technical briefing for Locally's product taxonomy classification system. It explains the architecture, the ML concepts involved, and how the pieces fit together. Written for a technical leader who needs to understand the system deeply enough to make planning decisions and avoid pitfalls.
Locally has ~891,000 products. Each one needs to be assigned to a category in a tree of 5,595 Google Product Taxonomy categories (like "Apparel & Accessories > Shoes > Athletic Shoes > Running Shoes"). Currently ~311,000 products are uncategorized, and the company plans to scale product count 10x over the next 2 years. Manual classification doesn't scale. We need an automated system that's accurate enough to trust, and smart enough to know when it's not sure.
Think of it like hiring for a job. Stage 1 is the recruiter screening resumes to build a shortlist. Stage 2 is the hiring committee evaluating each shortlisted candidate in depth.
Goal: For a given product, generate a shortlist of 5-30 plausible categories from the full 5,595.
Key principle: Multiple independent sources each suggest candidates. No single source acts as a gatekeeper. Think of it like asking 5 different experts "what category could this product be in?" and collecting ALL of their suggestions before evaluating any of them.
The five signal sources:
- Embeddings (FAISS) — "What does this product's text mean?"
- Affiliate mapping — "Has a human at an affiliate network already classified this?"
- Brand taxonomy — "What does the brand itself call this product?"
- Co-occurrence — "What categories do products in the same stores belong to?"
- Brand frequency — "What categories do other products from this brand belong to?"
Each source independently suggests categories. We take the union of all suggestions. This is critical — if the affiliate network says "Water Shoes" but the text embedding didn't rank it in its top results, "Water Shoes" still enters the candidate pool. No signal gets silenced.
Goal: Score every candidate from Stage 1 and pick the best one.
For each (product, candidate category) pair, we extract ~12-15 numerical features from all signal sources and feed them into a trained XGBoost model. The model outputs a probability: "How likely is this the correct category?"
Why a learned model instead of fixed weights? Because the right answer depends on context. For single-category brands like Brooks Running, the brand signal is almost always right. For diversified brands like Nike, the brand signal is weak. A fixed-weight formula treats both cases the same. A learned model figures out "when brand diversity is low, trust brand signal more."
Visual analogy: Imagine a massive library where every book has a GPS coordinate. Books about similar topics are physically close together on a giant map. "Running shoes" is near "athletic footwear" and "jogging sneakers" — even though the words are completely different — because they mean the same thing.
Text embeddings do this with words and sentences. A neural network (Vertex AI's text-embedding-004) converts text into a list of 768 numbers — a coordinate in 768-dimensional space. Texts with similar meanings end up at nearby coordinates.
We embed every category name ("Athletic Shoes > Running Shoes") and every product description ("Brooks Ghost 15 Men's Road Running Shoe"). Then we find which category coordinates are closest to each product's coordinate.
The math: "Closeness" is measured by cosine similarity — the angle between two vectors. If two vectors point in almost the same direction, cosine similarity is close to 1.0 (very similar). If they point in perpendicular directions, it's close to 0.0 (unrelated).
Strength: Works on ANY product, even brand-new ones with no sales history. It understands language. Weakness: Only understands text — can't incorporate business knowledge like "stores that sell this product also sell running shoes." Can be fooled by misleading descriptions.
Visual analogy: Imagine you have 5,595 pins on a map (one per category). You drop a new pin (a product) and need to find the 20 nearest existing pins. You could measure the distance to all 5,595 — but FAISS uses clever spatial indexing (like dividing the map into grid squares) so it only checks the nearby squares. Much faster.
FAISS is a library for fast nearest-neighbor search in high-dimensional space. Given a product's embedding vector, it finds the closest category vectors in milliseconds.
In our system: FAISS is used ONLY in Stage 1 (candidate recall) to propose embedding-based candidates. It's one of five voices suggesting categories — not a filter.
Visual analogy: Imagine a panel of judges scoring Olympic diving. Each judge is mediocre individually — they each focus on one thing (splash, form, difficulty). But when you combine their scores intelligently, the panel is highly accurate. Now imagine the judges can also learn from past competitions — after watching 581,000 dives with known correct scores, they learn which judge to trust more for which type of dive.
XGBoost builds a "forest" of decision trees, where each tree corrects the mistakes of the previous ones. A single decision tree is like a flowchart:
Is affiliate_match = true?
├─ YES: Is affiliate_confidence > 0.9?
│ ├─ YES: Probably correct (score +0.8)
│ └─ NO: Maybe correct (score +0.3)
└─ NO: Is brand_diversity < 5?
├─ YES: Trust brand signal (score +0.5)
└─ NO: Rely on embedding (score +0.2)
XGBoost builds hundreds of these trees, each one small and weak, but together they form a powerful predictor. The "gradient boosted" part means each new tree specifically focuses on the cases the previous trees got wrong — like a student reviewing their mistakes.
Why XGBoost for this problem:
- Handles mixed data types (numbers, yes/no flags, counts) without preprocessing
- Learns non-linear relationships ("when A AND B, the answer changes")
- Fast: classifies a product in under 1 millisecond
- Small: model is a few megabytes, runs on a regular server
- Interpretable: you can see which features drove each decision (via SHAP)
- Battle-tested: this is what Shopify, Amazon, and Google Shopping use for product taxonomy
Visual analogy: Imagine you're on a jury and you reach a verdict. SHAP is like each juror writing down "here's how much my testimony moved the verdict toward guilty vs. innocent." It decomposes the final decision into each feature's individual contribution.
For our system, when the model says "Running Shoes with 81% confidence," SHAP tells you:
- Affiliate match pushed it +25% toward Running Shoes
- Brand frequency pushed it +18% toward Running Shoes
- Embedding similarity pushed it +15% toward Running Shoes
- Co-occurrence pushed it +8% toward Running Shoes
- Category depth pushed it -3% (slight penalty for being very specific)
This powers the human review queue — reviewers can see why the model made each suggestion, not just the suggestion itself. This builds trust and helps catch systematic errors.
Visual analogy: Think of a standardized test that stays the same every year. Students (model versions) change, but the test doesn't. This lets you compare "the class of 2025 scored 78%, the class of 2026 scored 83%" — because they took the same test.
We set aside ~58,000 labeled products as a "gold set" that the model NEVER trains on. Every time we change the model, we run it against this same set and measure:
- Exact match accuracy: Did it pick the right category?
- Hierarchical accuracy: If wrong, how close? (Predicting "Athletic Shoes" when the answer is "Running Shoes" is much better than predicting "Camping Gear" — hierarchical scoring gives partial credit based on tree distance)
- Top-3 accuracy: Was the right answer in the model's top 3 guesses?
- Confidence calibration: When the model says "90% sure," is it actually right about 90% of the time?
Regression gates prevent deploying a worse model. If accuracy drops on any metric or any individual category drops more than 10%, the update is automatically rejected.
Visual analogy: Sorting mail. Pass 1 is the post office sorting letters into the right city (fast, high-level, easy to get right). Pass 2 is the local carrier sorting into the right street and house number (slower, more specific, sometimes needs a human to read the handwriting).
- Pass 1 (inline, during catalog ingestion): Assign each product to the broadest correct category with high confidence. "This is definitely a Shoe." Fast, automated, runs on every new product.
- Pass 2 (scheduled, weekly batch): Look at broadly-categorized products and try to push them deeper. "This Shoe is probably a Running Shoe because of X, Y, Z evidence." Outputs suggestions for human review.
Visual analogy: A chef improving a recipe. They cook it (train the model), taste-testers try it (eval against gold set), diners give feedback (human corrections), and the chef adjusts the recipe accordingly. Each iteration gets better.
Train model on 581k labeled products
→ Evaluate against gold set (score: 75%)
→ Deploy, classify new products
→ Humans correct mistakes (gold-standard feedback)
→ Add corrections to training data
→ Retrain model
→ Evaluate again (score: 78% — improvement!)
→ Deploy updated model
→ Repeat
The model gets better every cycle because human corrections are the highest-quality training signal. Over time, the system becomes increasingly accurate for Locally's specific product universe.
- What: Convert product name + description to a vector, find nearest category vectors
- Strength: Works on any product, even cold-start (no history needed)
- Weakness: Only understands text, not business context
- Analogy: A librarian who reads the book's back cover and shelves it by topic
- What: Affiliate networks (CJ, ShareASale) already classify products in their own taxonomy. We map their categories to ours.
- Strength: Human-classified, very reliable. Highest confidence signal.
- Weakness: Only exists for products in affiliate feeds (~40% coverage)
- Analogy: A translator who already has the answer in a different language — just needs to convert
- What: Brands send their own category hierarchy in catalog feeds (e.g., Keen: "Waterfront Footwear > Closed Toe Sandal")
- Strength: The brand knows its own products best. Most authoritative source.
- Weakness: Not all brands include taxonomy, formats vary wildly, needs normalization
- Analogy: The manufacturer's label on the product box
- What: "Stores that carry this product also carry a lot of Running Shoes, so this is probably a Running Shoe"
- Strength: Captures domain knowledge from retail patterns
- Weakness: Noisy for multi-category stores (Walmart carries everything). Doesn't work for new products with no inventory.
- Analogy: "You are the company you keep" — judging a product by its shelf neighbors
- What: "90% of Brooks products are Running Shoes, so a new Brooks product is probably a Running Shoe"
- Strength: Very strong for focused brands
- Weakness: Useless for diversified brands (Nike, Columbia). New brands have no history.
- Analogy: Stereotyping by family — "all the Smiths are doctors, so this new Smith is probably a doctor"
The original prototype used embeddings to pick top-5 candidates, then scored only those 5 with other signals. If the right answer wasn't in the embedding's top 5 (out of 5,595 categories!), the affiliate signal — the most reliable one — couldn't rescue it. Fix: All signals propose candidates independently, union them, then rank.
A hand-tuned "affiliate=5, embedding=3" weighting treats every product the same. But affiliate data is gold for footwear and unreliable for accessories. Brand signal is decisive for Brooks and meaningless for Nike. Fix: Learned model discovers context-dependent weights.
If the gold set (test data) is ever used for training, your eval metrics are meaningless — you're grading the student with the answer key they already studied. Fix: Freeze the gold set, never train on it, audit for leakage.
Predicting "Athletic Shoes" when the answer is "Running Shoes" is nearly right — they're parent/child in the tree. Predicting "Camping Gear" is a disaster. A flat accuracy metric treats both errors the same. Fix: Hierarchical accuracy gives partial credit for near-misses.
A model retrained on bad data (mislabeled corrections, biased sample) could be worse than the previous version. Without automated checks, you'd deploy it and not notice until classification quality degrades. Fix: Eval harness with automated regression gates blocks any model that scores lower than its predecessor.
A model that says "90% confident" but is only right 60% of the time is dangerous — you'd trust it when you shouldn't. Fix: Measure calibration explicitly and require it stays within tolerance.
NEW PRODUCT ARRIVES (via catalog feed)
│
▼
┌─────────────────────────┐
│ STAGE 1: Candidate │ "Who could this product be?"
│ Recall │
│ │ 5 experts each suggest categories:
│ Affiliate: Water Shoes │ - Affiliate network says Water Shoes
│ Brand tax: Waterfront │ - Keen's own catalog says Waterfront Footwear
│ Embedding: Sandals, │ - Text similarity says Sandals (top), Water Shoes (#8)
│ Outdoor Shoes, ... │ - Stores say Sandals, Water Shoes, Hiking
│ Co-occur: Sandals, │ - Other Keen products are mostly Outdoor Footwear
│ Water Shoes, Hiking │
│ Brand freq: Outdoor │ Union: {Water Shoes, Sandals, Waterfront,
│ Footwear │ Outdoor Shoes, Hiking, ...} = 12 candidates
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ STAGE 2: Learned │ "Score each candidate across ALL signals"
│ Ranker (XGBoost) │
│ │ For each of 12 candidates, build feature vector:
│ Water Shoes: │ embed=0.68, affiliate=YES(0.95), brand_tax=YES(0.85),
│ → P(correct) = 0.91 │ co_occur=0.35, brand_freq=0.40, brand_div=8, depth=4
│ │
│ Sandals: │ embed=0.87, affiliate=NO, brand_tax=NO,
│ → P(correct) = 0.23 │ co_occur=0.40, brand_freq=0.15, brand_div=8, depth=3
│ │
│ (... 10 more ...) │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ DECISION │
│ │ Winner: Water Shoes (0.91 confidence)
│ High confidence (>0.7) │ → Auto-assign category + tags
│ → Auto-assign │ → SHAP: "affiliate match drove 40% of decision,
│ │ brand taxonomy drove 30%, embedding 15%"
│ Low confidence (<0.5) │
│ → Queue for human review│
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ WEEKLY: Pass 2 Audit │ "Can we be more specific?"
│ │
│ Product is in "Shoes" │ Model suggests: "Running Shoes" (child category)
│ (broad category) │ Evidence: brand=Brooks, co-occur=running stores
│ │ → Goes to human review sheet with SHAP explanation
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ HUMAN REVIEW │ Reviewer sees suggestion + evidence
│ (Google Sheet) │ Approves / Rejects / Overrides
│ │ Corrections → new training data
│ │ → Model gets smarter next cycle
└─────────────────────────┘
| Component | Technology | Role |
|---|---|---|
| Embeddings | Vertex AI text-embedding-004 | Convert text to meaning-vectors |
| Vector search | FAISS | Fast nearest-neighbor lookup |
| Learned ranker | XGBoost or LightGBM | Score candidates using all features |
| Explainability | SHAP | Show why model made each decision |
| Data warehouse | BigQuery | Product data, features, eval results |
| Human review | Google Sheets | Approve/reject/override suggestions |
| Alerts | Slack | Daily stats, eval scorecards |
| Language | Python | All runtime code |
Parent: FRG-116 "Build Taxonomy Collector" Key sub-issues:
- FRG-118: Database schema
- FRG-119: Populate Google taxonomy data
- FRG-125/126: Brand taxonomy extraction + Rosetta Stone extension
- FRG-132: Eval framework (gold set + scoring harness) — BUILD FIRST
- FRG-124: Audit pipeline (the learned ranker)
- FRG-130: Regular audit schedule + retraining
- FRG-131: Integrate classifier into catalog ingestion pipeline