Locally's Taxonomy Classification System — NotebookLM Briefing

What This Document Is

This is a technical briefing for Locally's product taxonomy classification system. It explains the architecture, the ML concepts involved, and how the pieces fit together. Written for a technical leader who needs to understand the system deeply enough to make planning decisions and avoid pitfalls.

The Problem

Locally has ~891,000 products. Each one needs to be assigned to a category in a tree of 5,595 Google Product Taxonomy categories (like "Apparel & Accessories > Shoes > Athletic Shoes > Running Shoes"). Currently ~311,000 products are uncategorized, and the company plans to scale product count 10x over the next 2 years. Manual classification doesn't scale. We need an automated system that's accurate enough to trust, and smart enough to know when it's not sure.

The Architecture: Two-Stage Classification

Think of it like hiring for a job. Stage 1 is the recruiter screening resumes to build a shortlist. Stage 2 is the hiring committee evaluating each shortlisted candidate in depth.

Stage 1: Candidate Recall

Goal: For a given product, generate a shortlist of 5-30 plausible categories from the full 5,595.

Key principle: Multiple independent sources each suggest candidates. No single source acts as a gatekeeper. Think of it like asking 5 different experts "what category could this product be in?" and collecting ALL of their suggestions before evaluating any of them.

The five signal sources:

Embeddings (FAISS) — "What does this product's text mean?"
Affiliate mapping — "Has a human at an affiliate network already classified this?"
Brand taxonomy — "What does the brand itself call this product?"
Co-occurrence — "What categories do products in the same stores belong to?"
Brand frequency — "What categories do other products from this brand belong to?"

Each source independently suggests categories. We take the union of all suggestions. This is critical — if the affiliate network says "Water Shoes" but the text embedding didn't rank it in its top results, "Water Shoes" still enters the candidate pool. No signal gets silenced.

Stage 2: Learned Ranker (XGBoost)

Goal: Score every candidate from Stage 1 and pick the best one.

For each (product, candidate category) pair, we extract ~12-15 numerical features from all signal sources and feed them into a trained XGBoost model. The model outputs a probability: "How likely is this the correct category?"

Why a learned model instead of fixed weights? Because the right answer depends on context. For single-category brands like Brooks Running, the brand signal is almost always right. For diversified brands like Nike, the brand signal is weak. A fixed-weight formula treats both cases the same. A learned model figures out "when brand diversity is low, trust brand signal more."

Key Concepts Explained

Text Embeddings

Visual analogy: Imagine a massive library where every book has a GPS coordinate. Books about similar topics are physically close together on a giant map. "Running shoes" is near "athletic footwear" and "jogging sneakers" — even though the words are completely different — because they mean the same thing.

Text embeddings do this with words and sentences. A neural network (Vertex AI's text-embedding-004) converts text into a list of 768 numbers — a coordinate in 768-dimensional space. Texts with similar meanings end up at nearby coordinates.

We embed every category name ("Athletic Shoes > Running Shoes") and every product description ("Brooks Ghost 15 Men's Road Running Shoe"). Then we find which category coordinates are closest to each product's coordinate.

The math: "Closeness" is measured by cosine similarity — the angle between two vectors. If two vectors point in almost the same direction, cosine similarity is close to 1.0 (very similar). If they point in perpendicular directions, it's close to 0.0 (unrelated).

Strength: Works on ANY product, even brand-new ones with no sales history. It understands language. Weakness: Only understands text — can't incorporate business knowledge like "stores that sell this product also sell running shoes." Can be fooled by misleading descriptions.

FAISS (Facebook AI Similarity Search)

Visual analogy: Imagine you have 5,595 pins on a map (one per category). You drop a new pin (a product) and need to find the 20 nearest existing pins. You could measure the distance to all 5,595 — but FAISS uses clever spatial indexing (like dividing the map into grid squares) so it only checks the nearby squares. Much faster.

FAISS is a library for fast nearest-neighbor search in high-dimensional space. Given a product's embedding vector, it finds the closest category vectors in milliseconds.

In our system: FAISS is used ONLY in Stage 1 (candidate recall) to propose embedding-based candidates. It's one of five voices suggesting categories — not a filter.

XGBoost / LightGBM (Gradient Boosted Trees)

Visual analogy: Imagine a panel of judges scoring Olympic diving. Each judge is mediocre individually — they each focus on one thing (splash, form, difficulty). But when you combine their scores intelligently, the panel is highly accurate. Now imagine the judges can also learn from past competitions — after watching 581,000 dives with known correct scores, they learn which judge to trust more for which type of dive.

XGBoost builds a "forest" of decision trees, where each tree corrects the mistakes of the previous ones. A single decision tree is like a flowchart:

Is affiliate_match = true?
  ├─ YES: Is affiliate_confidence > 0.9?
  │       ├─ YES: Probably correct (score +0.8)
  │       └─ NO:  Maybe correct (score +0.3)
  └─ NO:  Is brand_diversity < 5?
          ├─ YES: Trust brand signal (score +0.5)
          └─ NO:  Rely on embedding (score +0.2)

XGBoost builds hundreds of these trees, each one small and weak, but together they form a powerful predictor. The "gradient boosted" part means each new tree specifically focuses on the cases the previous trees got wrong — like a student reviewing their mistakes.

Why XGBoost for this problem:

Handles mixed data types (numbers, yes/no flags, counts) without preprocessing
Learns non-linear relationships ("when A AND B, the answer changes")
Fast: classifies a product in under 1 millisecond
Small: model is a few megabytes, runs on a regular server
Interpretable: you can see which features drove each decision (via SHAP)
Battle-tested: this is what Shopify, Amazon, and Google Shopping use for product taxonomy

SHAP (SHapley Additive exPlanations)

Visual analogy: Imagine you're on a jury and you reach a verdict. SHAP is like each juror writing down "here's how much my testimony moved the verdict toward guilty vs. innocent." It decomposes the final decision into each feature's individual contribution.

For our system, when the model says "Running Shoes with 81% confidence," SHAP tells you:

Affiliate match pushed it +25% toward Running Shoes
Brand frequency pushed it +18% toward Running Shoes
Embedding similarity pushed it +15% toward Running Shoes
Co-occurrence pushed it +8% toward Running Shoes
Category depth pushed it -3% (slight penalty for being very specific)

This powers the human review queue — reviewers can see why the model made each suggestion, not just the suggestion itself. This builds trust and helps catch systematic errors.

The Eval Framework

Visual analogy: Think of a standardized test that stays the same every year. Students (model versions) change, but the test doesn't. This lets you compare "the class of 2025 scored 78%, the class of 2026 scored 83%" — because they took the same test.

We set aside ~58,000 labeled products as a "gold set" that the model NEVER trains on. Every time we change the model, we run it against this same set and measure:

Exact match accuracy: Did it pick the right category?
Hierarchical accuracy: If wrong, how close? (Predicting "Athletic Shoes" when the answer is "Running Shoes" is much better than predicting "Camping Gear" — hierarchical scoring gives partial credit based on tree distance)
Top-3 accuracy: Was the right answer in the model's top 3 guesses?
Confidence calibration: When the model says "90% sure," is it actually right about 90% of the time?

Regression gates prevent deploying a worse model. If accuracy drops on any metric or any individual category drops more than 10%, the update is automatically rejected.

Two-Pass Classification

Visual analogy: Sorting mail. Pass 1 is the post office sorting letters into the right city (fast, high-level, easy to get right). Pass 2 is the local carrier sorting into the right street and house number (slower, more specific, sometimes needs a human to read the handwriting).

Pass 1 (inline, during catalog ingestion): Assign each product to the broadest correct category with high confidence. "This is definitely a Shoe." Fast, automated, runs on every new product.
Pass 2 (scheduled, weekly batch): Look at broadly-categorized products and try to push them deeper. "This Shoe is probably a Running Shoe because of X, Y, Z evidence." Outputs suggestions for human review.

The Training Loop

Visual analogy: A chef improving a recipe. They cook it (train the model), taste-testers try it (eval against gold set), diners give feedback (human corrections), and the chef adjusts the recipe accordingly. Each iteration gets better.

Train model on 581k labeled products
  → Evaluate against gold set (score: 75%)
  → Deploy, classify new products
  → Humans correct mistakes (gold-standard feedback)
  → Add corrections to training data
  → Retrain model
  → Evaluate again (score: 78% — improvement!)
  → Deploy updated model
  → Repeat

The model gets better every cycle because human corrections are the highest-quality training signal. Over time, the system becomes increasingly accurate for Locally's specific product universe.

The Five Signal Layers (Feature Sources)

1. Text Embeddings (Vertex AI + FAISS)

What: Convert product name + description to a vector, find nearest category vectors
Strength: Works on any product, even cold-start (no history needed)
Weakness: Only understands text, not business context
Analogy: A librarian who reads the book's back cover and shelves it by topic

2. Affiliate Feed Mapping (Rosetta Stone)

What: Affiliate networks (CJ, ShareASale) already classify products in their own taxonomy. We map their categories to ours.
Strength: Human-classified, very reliable. Highest confidence signal.
Weakness: Only exists for products in affiliate feeds (~40% coverage)
Analogy: A translator who already has the answer in a different language — just needs to convert

3. Brand Taxonomy (from catalog feeds)

What: Brands send their own category hierarchy in catalog feeds (e.g., Keen: "Waterfront Footwear > Closed Toe Sandal")
Strength: The brand knows its own products best. Most authoritative source.
Weakness: Not all brands include taxonomy, formats vary wildly, needs normalization
Analogy: The manufacturer's label on the product box

4. Store Co-occurrence (Collaborative Filtering)

What: "Stores that carry this product also carry a lot of Running Shoes, so this is probably a Running Shoe"
Strength: Captures domain knowledge from retail patterns
Weakness: Noisy for multi-category stores (Walmart carries everything). Doesn't work for new products with no inventory.
Analogy: "You are the company you keep" — judging a product by its shelf neighbors

5. Brand Frequency

What: "90% of Brooks products are Running Shoes, so a new Brooks product is probably a Running Shoe"
Strength: Very strong for focused brands
Weakness: Useless for diversified brands (Nike, Columbia). New brands have no history.
Analogy: Stereotyping by family — "all the Smiths are doctors, so this new Smith is probably a doctor"

Common Pitfalls to Avoid

1. Letting one signal gate the others

The original prototype used embeddings to pick top-5 candidates, then scored only those 5 with other signals. If the right answer wasn't in the embedding's top 5 (out of 5,595 categories!), the affiliate signal — the most reliable one — couldn't rescue it. Fix: All signals propose candidates independently, union them, then rank.

2. Fixed weights that ignore context

A hand-tuned "affiliate=5, embedding=3" weighting treats every product the same. But affiliate data is gold for footwear and unreliable for accessories. Brand signal is decisive for Brooks and meaningless for Nike. Fix: Learned model discovers context-dependent weights.

3. Training on the test set (data leakage)

If the gold set (test data) is ever used for training, your eval metrics are meaningless — you're grading the student with the answer key they already studied. Fix: Freeze the gold set, never train on it, audit for leakage.

4. Optimizing for exact match only

Predicting "Athletic Shoes" when the answer is "Running Shoes" is nearly right — they're parent/child in the tree. Predicting "Camping Gear" is a disaster. A flat accuracy metric treats both errors the same. Fix: Hierarchical accuracy gives partial credit for near-misses.

5. Deploying without regression gates

A model retrained on bad data (mislabeled corrections, biased sample) could be worse than the previous version. Without automated checks, you'd deploy it and not notice until classification quality degrades. Fix: Eval harness with automated regression gates blocks any model that scores lower than its predecessor.

6. Ignoring confidence calibration

A model that says "90% confident" but is only right 60% of the time is dangerous — you'd trust it when you shouldn't. Fix: Measure calibration explicitly and require it stays within tolerance.

How It All Fits Together (End to End)

NEW PRODUCT ARRIVES (via catalog feed)
         │
         ▼
┌─────────────────────────┐
│  STAGE 1: Candidate     │  "Who could this product be?"
│  Recall                  │
│                          │  5 experts each suggest categories:
│  Affiliate: Water Shoes  │  - Affiliate network says Water Shoes
│  Brand tax: Waterfront   │  - Keen's own catalog says Waterfront Footwear  
│  Embedding: Sandals,     │  - Text similarity says Sandals (top), Water Shoes (#8)
│    Outdoor Shoes, ...    │  - Stores say Sandals, Water Shoes, Hiking
│  Co-occur: Sandals,      │  - Other Keen products are mostly Outdoor Footwear
│    Water Shoes, Hiking   │
│  Brand freq: Outdoor     │  Union: {Water Shoes, Sandals, Waterfront,
│    Footwear              │          Outdoor Shoes, Hiking, ...} = 12 candidates
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  STAGE 2: Learned       │  "Score each candidate across ALL signals"
│  Ranker (XGBoost)       │
│                          │  For each of 12 candidates, build feature vector:
│  Water Shoes:            │  embed=0.68, affiliate=YES(0.95), brand_tax=YES(0.85),
│    → P(correct) = 0.91  │  co_occur=0.35, brand_freq=0.40, brand_div=8, depth=4
│                          │
│  Sandals:                │  embed=0.87, affiliate=NO, brand_tax=NO,
│    → P(correct) = 0.23  │  co_occur=0.40, brand_freq=0.15, brand_div=8, depth=3
│                          │
│  (... 10 more ...)       │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  DECISION                │
│                          │  Winner: Water Shoes (0.91 confidence)
│  High confidence (>0.7)  │  → Auto-assign category + tags
│  → Auto-assign           │  → SHAP: "affiliate match drove 40% of decision,
│                          │    brand taxonomy drove 30%, embedding 15%"
│  Low confidence (<0.5)   │
│  → Queue for human review│
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  WEEKLY: Pass 2 Audit   │  "Can we be more specific?"
│                          │
│  Product is in "Shoes"   │  Model suggests: "Running Shoes" (child category)
│  (broad category)        │  Evidence: brand=Brooks, co-occur=running stores
│                          │  → Goes to human review sheet with SHAP explanation
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  HUMAN REVIEW            │  Reviewer sees suggestion + evidence
│  (Google Sheet)          │  Approves / Rejects / Overrides
│                          │  Corrections → new training data
│                          │  → Model gets smarter next cycle
└─────────────────────────┘

Technology Stack Summary

Component	Technology	Role
Embeddings	Vertex AI text-embedding-004	Convert text to meaning-vectors
Vector search	FAISS	Fast nearest-neighbor lookup
Learned ranker	XGBoost or LightGBM	Score candidates using all features
Explainability	SHAP	Show why model made each decision
Data warehouse	BigQuery	Product data, features, eval results
Human review	Google Sheets	Approve/reject/override suggestions
Alerts	Slack	Daily stats, eval scorecards
Language	Python	All runtime code

Linear Issues (Project Tracking)

Parent: FRG-116 "Build Taxonomy Collector" Key sub-issues:

FRG-118: Database schema
FRG-119: Populate Google taxonomy data
FRG-125/126: Brand taxonomy extraction + Rosetta Stone extension
FRG-132: Eval framework (gold set + scoring harness) — BUILD FIRST
FRG-124: Audit pipeline (the learned ranker)
FRG-130: Regular audit schedule + retraining
FRG-131: Integrate classifier into catalog ingestion pipeline

NotebookLM Podcast Prompt

Instructions for the hosts

The listener is a veteran web developer (24 years experience) and CTO who is building a product taxonomy classification system for his e-commerce company. He's deeply technical but new to ML/AI classification. He thinks in visual analogies and spatial metaphors — use those heavily.

Tone: Two smart friends explaining this to a fellow engineer over beers. Not dumbing it down, but making it vivid and intuitive. Use lots of "imagine..." and "think of it like..." moments.

Cover these topics in this order:

The core problem: Why is product taxonomy classification hard? Why can't you just keyword-match? Use the "Keen Newport H2" example — is it a sandal, a water shoe, or outdoor footwear? The text says one thing, the stores say another, the brand says a third. This is fundamentally a multi-signal fusion problem.
Text embeddings — the foundation: How do you turn words into math? Walk through the concept of embedding space as a map where similar meanings cluster together. "Running shoes" and "jogging sneakers" end up near each other even though the words are different. Explain cosine similarity as "which direction are you pointing?" rather than "how far apart are you?"
FAISS — fast search in meaning-space: You have 5,595 category pins on the map and you drop a new product pin. How do you find the nearest ones without checking all 5,595? Explain the spatial indexing intuition.
The critical mistake — don't let one signal be the gatekeeper: The original design used embeddings to pick the top 5 candidates, then asked other signals to vote only on those 5. Out of 5,595 categories! If the right answer was #8 in embedding rank, the affiliate signal — which had the RIGHT answer — couldn't even suggest it. This is like only letting the recruiter's shortlist go to the hiring committee, even when an employee referral (the best signal!) pointed to someone the recruiter missed. Explain why "union of all candidates, then rank" is fundamentally better.
XGBoost — the learned ranker: Why not just average the scores with fixed weights? Because context matters. The affiliate signal is gold for footwear but unreliable for accessories. Brand signal is decisive for Brooks (single-category brand) but useless for Nike (sells everything). A learned model discovers these patterns. Use the "panel of Olympic judges" analogy — each judge is mediocre alone, but the panel learns which judge to trust for which type of dive. Explain gradient boosting as "each new tree focuses on what the previous trees got wrong."
SHAP — the explainability layer: When the model says "Running Shoes, 81% confidence," SHAP tells you WHY. Each feature's contribution is decomposed. This isn't just nice-to-have — it powers the human review queue. Reviewers need to see "affiliate match drove 40% of this decision" to trust or override it. Use the jury analogy.
The eval framework — your standardized test: Why you need a frozen gold set that never changes. The school exam analogy. Hierarchical accuracy (partial credit for near-misses). Regression gates (automatic rejection of worse models). Confidence calibration (when the model says 90%, it better be right ~90% of the time). This is what separates a professional ML system from a prototype.
The training loop — how it gets smarter: Start with 581k already-labeled products. Train model. Deploy. Humans correct mistakes. Those corrections become gold-standard training data. Retrain. The chef improving a recipe analogy.
Two-pass classification — sorting mail: Pass 1 is the post office (sort into the right city — fast, broad). Pass 2 is the local carrier (sort into the right house — slower, more specific, sometimes needs a human to read the handwriting). Pass 1 runs inline during catalog ingestion. Pass 2 runs weekly as a batch audit.
Pitfalls to warn about: Data leakage (grading with the answer key). Fixed weights that ignore context. Deploying without eval gates. Optimizing for exact match only (missing the value of hierarchical accuracy). Overconfident models (calibration matters).

Key visual analogies to use:

Embedding space = GPS map where similar meanings are nearby
FAISS = spatial index on that map (grid squares for fast lookup)
Candidate recall = 5 experts each suggesting candidates, union their lists
XGBoost = panel of Olympic judges that learns which judge to trust for which dive
SHAP = each juror writing down how much their testimony moved the verdict
Eval gold set = standardized test that stays the same every year
Two-pass = post office (city) then local carrier (street address)
Training loop = chef improving a recipe with diner feedback
Regression gates = quality control that rejects a batch if it's worse than the last one

Make sure to emphasize: This system is designed to be the crown jewel of the company's technology. It needs to scale to 10x the current product volume. Getting taxonomy right is existential for a product discovery platform. The difference between a hand-tuned prototype and a production ML system is eval, retraining, and explainability.

phirschybar/notebooklm-briefing.md

Locally's Taxonomy Classification System — NotebookLM Briefing

What This Document Is

The Problem

The Architecture: Two-Stage Classification

Stage 1: Candidate Recall

Stage 2: Learned Ranker (XGBoost)

Key Concepts Explained

Text Embeddings

FAISS (Facebook AI Similarity Search)

XGBoost / LightGBM (Gradient Boosted Trees)

SHAP (SHapley Additive exPlanations)

The Eval Framework

Two-Pass Classification

The Training Loop

The Five Signal Layers (Feature Sources)

1. Text Embeddings (Vertex AI + FAISS)

2. Affiliate Feed Mapping (Rosetta Stone)

3. Brand Taxonomy (from catalog feeds)

4. Store Co-occurrence (Collaborative Filtering)

5. Brand Frequency

Common Pitfalls to Avoid

1. Letting one signal gate the others

2. Fixed weights that ignore context

3. Training on the test set (data leakage)

4. Optimizing for exact match only

5. Deploying without regression gates

6. Ignoring confidence calibration

How It All Fits Together (End to End)

Technology Stack Summary

Linear Issues (Project Tracking)

NotebookLM Podcast Prompt

Instructions for the hosts

External Resources for NotebookLM

Core Concepts

XGBoost / Gradient Boosted Trees

Text Embeddings

FAISS (Vector Search)

SHAP (Explainability)

Two-Stage Retrieval (Candidate Recall + Ranking)

Product Taxonomy / Classification

Eval / ML Testing

LightGBM (alternative to XGBoost — may end up using this)