Skip to content

Instantly share code, notes, and snippets.

@andrew-templeton
Created December 23, 2025 18:32
Show Gist options
  • Select an option

  • Save andrew-templeton/76b3ffa3674db0a5b6bcf4597f4b9f7f to your computer and use it in GitHub Desktop.

Select an option

Save andrew-templeton/76b3ffa3674db0a5b6bcf4597f4b9f7f to your computer and use it in GitHub Desktop.
SLT VP Product & Technology — Take-Home Case Study (Candidate Materials)

Sur La Table (SLT) — VP Product & Technology Take-Home: Working Context Brief

This document is provided to candidates alongside the case study. It defines the operating environment, constraints, and how to interpret the data.


1) What you should assume is true

  • You are joining as VP of Product Management & Tech at Sur La Table.
  • You own product + software platforms end-to-end: e-commerce (Next.js/headless), BFF/API gateway, services, OMS, POS software, data warehouse/ETL, loyalty, personalization, ERP integrations.
  • SLT-only operating model: there is no CSC matrixed engineering to rely on.
  • Store hardware/network operations are owned day-to-day by the IT Director, but you set standards/security policy and approve major infra changes.
  • This take-home case uses synthetic but internally consistent metrics. Your job is to make decisions and plans using what's here, while clearly stating bounded assumptions.

2) Org, team, and operating model

2.1 Reporting lines (your directs)

  • Sr. Director of Engineering (software engineering, SRE/platform, QA)
  • Head of Product (PMs, UX partnership)
  • IT Director (store infrastructure/devices, networking, endpoint management)

2.2 Team size (this is the reality you inherit)

Function Headcount Notes
Software Engineers 8 Full-stack and backend; no dedicated frontend
Product Managers 2 One Sr PM (web/checkout), one PM (store systems)
QA 4 Manual + automation; owns release confidence
SRE/Platform 2 SLOs, observability, incident response, infra
App Support 2 L1/L2 triage, escalation to engineering
Total 18

Implication: With 8 engineers, you cannot pursue every initiative. Prioritization under constraint is the test. The 2 allowed contractors (per budget rules) are a critical lever.

Contractor economics: $180/hr fully loaded; 2-week onboarding; ~3 weeks productive time before freeze

QA capacity: Can support 2 major initiatives in parallel; beyond that requires sequencing or external help

2.3 Team topology (current state)

  • No formal squads — engineers are assigned to work as needed
  • Typical splits: ~4 engineers on web/checkout, ~2 on OMS/services, ~2 on data/integrations
  • You may propose a different structure in your plan

2.4 Decision rights (for this case)

  • You are final decision maker on roadmap trade-offs, release gates, and investment prioritization.
  • IT Director must approve changes affecting store networks/devices; you co-sign major changes impacting store-cloud segmentation.
  • Architecture & Experiment Review Board (ARB) is your mechanism for enforcing standards/guardrails.
    • ARB composition: VP (chair) + Sr Dir Eng + Head of Product
    • Cadence: Meets weekly; fast-track approval available for critical items

2.5 Realistic initiative limits

With 8 engineers × 5 weeks (40 eng-weeks) + 2 contractors × ~3 productive weeks (6 eng-weeks, conservative) = 46-47 eng-weeks available.

Rough sizing:

Initiative Typical Effort
Variant A ship (with guardrails) 4-6 eng-weeks (+2 if payment fallback)
OMS reliability (timeout fixes) 6-8 eng-weeks
CWV quick wins (ISR, caching) 4-6 eng-weeks
Peak capacity (scaling + load test) 6-8 eng-weeks
Legacy OrderBridge (partial) 8-12 eng-weeks

Implication: You can realistically complete 2-3 medium initiatives pre-freeze. Proposing 5+ parallel initiatives signals unrealistic planning.


3) Business context: Sur La Table

3.1 What SLT sells

  • Premium cookware, kitchenware, and gourmet foods — Williams-Sonoma competitor positioning
  • Cooking classes — in-store and online; meaningful revenue line (~10% of total)
  • Gift registry — weddings, housewarmings

3.2 Channel mix

  • E-commerce: ~55% of revenue (growing)
  • Retail stores: ~50 locations nationally; ~35% of revenue
  • Cooking classes: ~10% of revenue (high margin, drives store traffic)

3.3 Seasonality

  • Q4 is everything: Nov-Dec represents ~40% of annual revenue
  • Key dates: Black Friday, Cyber Monday, holiday gifting
  • Cookware as gifts: High AOV, high consideration purchase

3.4 Competitive pressure

  • Williams-Sonoma, Crate & Barrel, Amazon (commoditization)
  • Differentiation through expertise, classes, curated assortment

4) Systems landscape (high-level)

4.1 Customer-facing digital

  • Web storefront: Next.js headless application (current; needs performance work)
  • Key journeys: Home → PLP → PDP → Cart → Checkout
  • Performance instrumentation: Core Web Vitals from real-user monitoring (RUM)
  • Current capacity: ~400 RPS sustained at SLO latency (load tested in Q3)
  • Q4 pre-provisioned: Auto-scaling configured to 1,200 RPS burst; 800 RPS sustained (provisioned; load test validation pending)
  • Remaining gap to BFCM (2,625 RPS): CDN optimization, caching tuning, and load test validation required

4.2 API layer and services

  • BFF/API gateway: request aggregation/composition, auth/session handling, rate limiting/back-pressure, response caching headers
  • Core services: cart/pricing/promo, catalog/search integration, loyalty/personalization, payment orchestration

4.3 OMS and fulfillment

  • OMS: order create, inventory reservation/commit, fulfillment orchestration; integrates with ERP and store systems
  • Primary reliability KPI: OMS Success % (see definitions below)
  • Known issues: Timeouts (47% of failures), inventory mismatch (28%), payment gateway (15%)

4.4 Store systems (POS software)

  • POS software: store transactions (sales/returns), payment flows, device interactions
  • Primary reliability KPI: POS Success % during store hours
  • PCI scope includes POS
  • Current state: 99.86% success; minimal gap to 99.9% target — low priority unless other initiatives complete early

4.5 Data platform

  • Warehouse + ETL/CDC: supports BI, experiment analysis, operational dashboards
  • Primary latency KPI: ops feeds (orders/inventory) available in DWH within target window
  • Known issue: p95 lag is 48 min (target ≤ 15 min); contributes to 0.6% oversell rate

4.6 Legacy surface area (what "legacy" means here)

  • "OrderBridge" — legacy order/checkout integration layer being strangled
  • "Legacy endpoints" = APIs still required for order completion or downstream fulfillment
  • You will propose a strangler + parallel run + cutover plan with decommission milestones
  • Realistic expectation: Full cutover may extend past the 5-week pre-freeze window. Plan what's achievable pre-freeze vs. Q1.

5) Tooling you may assume is available

You do not need to pick vendors. Describe capabilities.

5.1 Experimentation and feature flags

  • Platform capable of: randomization, exposure logging, guardrails, sequential decisioning, phased rollouts, kill switch
  • Assume something like LaunchDarkly or Statsig

5.2 Observability

  • RUM: CWV p75 metrics (LCP/INP/CLS) and route-level breakdown
  • APM/tracing: service latency, dependency maps, error rates (Datadog-like)
  • Logs: structured logs with correlation IDs across web/BFF/services/OMS
  • Synthetics: checkout probes and key journey monitors

5.3 CI/CD and releases

  • Automated pipelines for web and services
  • Support for feature flags, canaries, and rapid rollback
  • Ability to enforce release checklists and block deploys when guardrails fail

5.4 Infrastructure

  • Cloud: AWS
  • Compute: ECS for services; considering Lambda@Edge for Next.js edge rendering
  • CDN: CloudFront (current); edge caching not fully optimized
  • CDN optimization note: Quick wins (cache rules, TTLs) achievable pre-freeze; advanced optimization (Lambda@Edge, custom origins) should be phased post-peak

5.5) Operational baselines (current state)

Metric Current Target
P1 incidents 1.2/week ≤1/week
MTTR 45 min ≤30 min
On-call load 2 eng/week rotation Maintain
Release cadence Web 3×/week, services 2×/week, OMS weekly Maintain
Change failure rate 8% <5% (stretch)
Rollback rate 6% <5%

Note: Rollback plans must be tested in staging before production use.


6) Governance constraints and dates

6.1 Peak / holiday window

  • Code freeze begins Nov 10
  • Exceptions allowed only via Major Release Go/No-Go with:
    • SLO targets satisfied or explicitly risk-accepted
    • Rollback tested and time-bounded (< 30 min)
    • Dashboards/alerts in place for primary metrics + guardrails
    • Named owner on-call during change window

6.2 Budget and resourcing

  • Opex remaining pre-peak: $1.2M (for contractors + cloud/infra + tooling through Q4; engineer salaries separate)
  • Net new hires frozen; you may use up to 2 contractors
  • Avoid engineering becoming a procurement bottleneck

Current infra run rate: ~$180K/month (compute: $110K, CDN: $35K, observability: $25K, misc: $10K)

Peak scaling assumption: 1.5-2× compute cost during Nov-Dec (~$165-220K/month for compute alone)

6.3 Escalation and exceptions

  • Freeze exceptions require VP approval + President notification
  • Emergency changes during peak require VP + Sr Dir Eng sign-off
  • On-call engineer has authority to rollback without approval

6.4 Key dates

  • Today (case context): Early October 2025
  • Nov 10: Code freeze begins
  • Nov 24-Dec 2: BFCM / peak week
  • You have ~5 weeks to execute pre-freeze initiatives

7) Metric definitions (authoritative for the exercise)

Note on basis points (bps): 1 bps = 0.01 percentage points. Example: CVR moving from 2.00% to 2.10% is a +10 bps improvement.

Metric Definition
CVR (site-wide) Successful orders / sessions
AOV Revenue / successful orders
Checkout p95 latency p95 end-to-end time from "Place order" click to confirmation render
CWV p75 p75 on mobile from RUM: LCP, INP, CLS
OMS Success % Successful order-creates / order-create attempts. Each automatic retry counts as a separate attempt. Manual customer retries (re-submitting after error page) count as new order attempts.
POS Success % Successful store transactions / attempts during store hours
P1 MTTR Median time from page to mitigation for P1 incidents
Error budget burn % of quarterly error budget consumed
Ops → DWH latency p95 minutes from source commit to warehouse availability
Change failure rate % of deploys causing customer-impacting degradation
Rollback rate % of deploys rolled back
Cost per order All-in cloud/platform/observability cost ÷ orders

8) Security, compliance, and segmentation

  • PCI-DSS applies to checkout/payment flows and includes POS
  • Store-cloud network segmentation required:
    • Store networks segmented from corporate and public
    • Least privilege connectivity to cloud endpoints
    • Logging/monitoring for store-origin traffic
  • GDPR/CPRA applies for customer data handling

9) What is explicitly in scope for candidate proposals

You may propose changes across:

  • Next.js rendering strategy and caching (SSR/SSG/ISR)
  • Edge/CDN configuration
  • BFF/API gateway composition/back-pressure
  • Service reliability improvements (timeouts/retries/circuit breakers, idempotency, queueing)
  • OMS integration robustness (inventory reservation timing, contract versioning, failure handling)
  • Release governance (guardrails, go/no-go, freeze exceptions)
  • Team structure and contractor allocation

10) What is intentionally not provided

We are not providing full system diagrams, vendor names, or code. When information is missing:

  • State a minimal assumption
  • Propose 1-2 options
  • Pick one and explain trade-offs and verification
  • Keep plans consistent with stated constraints

11) The core tension you must navigate

With 8 engineers and ~5 weeks to code freeze, you face hard trade-offs:

Note: Variant A (BFF changes) and CWV initiatives (Next.js/BFF optimization) share infrastructure and engineering resources. Consider sequencing when planning.

Option Risk
Ship Variant A for CVR lift OMS reliability drops; fulfillment failures during peak
Fix OMS first, delay Variant A Miss CVR opportunity; may not hit conversion targets
Do both in parallel Spread 8 engineers too thin; neither done well
Hire contractors for one workstream Onboarding time; quality risk

There is no "right" answer — there are defensible trade-offs. We're evaluating your judgment, not your ability to do everything.

SLT — VP Product & Technology: Take-Home Case Study

Company: Sur La Table (SLT Lending SPV, Inc.) Role: Vice President of Product Management & Tech Take-home timebox: 4-5 hours (please do not exceed) Live review: 60 minutes (exec readout + deep dives)


What we're evaluating

  1. Prioritization under constraint — With 8 engineers and 5 weeks to freeze, you cannot do everything. Choosing what NOT to do is as important as what you do.
  2. Trade-off reasoning — Especially conversion vs. reliability, speed vs. safety.
  3. Technical judgment — Next.js/headless, caching, BFF patterns, SLOs, observability.
  4. Operational rigor — Peak readiness, guardrails, rollback, incident response.
  5. Communication — Can you tell a clear story to the President and Finance?

The constraint you must internalize

Resource Reality
Engineers 8 total
Time to freeze ~5 weeks (Nov 10)
Contractors allowed 2 max
Budget $1.2M opex remaining

You will not complete every possible initiative. The test is whether you make defensible choices about what to prioritize and what to defer.


Target State (for your plan)

Your plan must show a path toward these targets:

Category Target
Checkout latency p95 ≤ 800 ms
Core Web Vitals LCP p75 ≤ 2.5s, INP p75 ≤ 200ms, CLS p75 ≤ 0.1
OMS reliability 99.5% success
POS reliability 99.9% success (store hours)
Peak headroom 2.5× p95 RPS for BFCM (design to ≥ 2,625 RPS)
Change safety CFR < 5%, rollback < 5%

Data Appendix

All data is synthetic but internally consistent. Use it to make quantified decisions.

A1. Product outcomes (weekly baseline)

Week Start Sessions CVR % AOV $ Orders Revenue $
2025-08-25 1,200,000 2.10 92 25,200 2,318,400
2025-09-01 1,150,000 2.05 91 23,575 2,145,325
2025-09-08 1,180,000 2.08 93 24,544 2,282,592
2025-09-15 1,220,000 2.12 92 25,864 2,379,488
2025-09-22 1,210,000 2.06 94 24,926 2,343,044
2025-09-29 1,190,000 2.04 92 24,276 2,233,392

Baseline CVR: ~2.07% | Baseline AOV: ~$92 (weekly range $91-94)

A2. Web performance & reliability (weekly)

Week Start Checkout p95 (ms) LCP p75 (s) INP p75 (ms) CLS p75 OMS % POS %
2025-08-25 1,250 3.20 240 0.12 99.20 99.86
2025-09-01 1,210 3.10 235 0.11 99.15 99.84
2025-09-08 1,180 3.05 230 0.11 99.25 99.87
2025-09-15 1,275 3.30 250 0.13 99.10 99.83
2025-09-22 1,230 3.15 245 0.12 99.18 99.85
2025-09-29 1,205 3.00 238 0.11 99.22 99.86

Gap to target: Checkout ~400ms over; LCP ~0.5s over; OMS ~30 bps under

B1. Checkout experiment (14-day sample)

Experiment ran Sept 15-28, overlapping with weekly baseline data above. Control metrics align with weekly averages.

Arm Sessions Orders CVR % (95% CI) AOV $ Revenue $ OMS % (95% CI) Checkout p95 LCP p75
Control (legacy) 1,200,000 24,960 2.08 (2.06-2.10) 92.0 2,296,320 99.20 (99.15-99.25) 1,220 ms 3.10 s
Variant A (BFF + new payment) 1,200,000 25,680 2.14 (2.12-2.16) 92.5 2,375,400 98.90 (98.85-98.95) 950 ms 2.60 s

The tension: Variant A improves CVR (+6 bps), latency (-270ms), and LCP (-0.5s), but OMS drops 30 bps (99.20% → 98.90%).

Funnel breakdown (add-to-cart, cart-to-checkout) not available — focus analysis on end-to-end CVR and OMS impact.

Math you should do:

  • Control: 24,960 orders at 99.20% OMS → ~25,161 attempts → 201 failed
  • Variant A: 25,680 orders at 98.90% OMS → ~25,965 attempts → 285 failed
  • Net: +720 orders, but +84 additional fulfillment failures

Statistical context:

  • Historical CVR standard deviation: ±3 bps
  • With 1.2M sessions per arm, α=0.05, power=0.80: experiment can detect ±2 bps differences
  • The +6 bps CVR lift is statistically significant (p < 0.001); focus your analysis on the risk/reward trade-off, not statistical validity

B2. Variant A OMS degradation — root cause analysis

Post-experiment investigation identified the primary driver:

Factor Control Variant A Impact
Payment integration latency (p95) 200 ms 350 ms +150 ms
Payment provider success rate 99.4% 99.1% -30 bps
End-to-end OMS request time (p95) 1,850 ms 2,000 ms +150 ms

Root cause: New payment provider in Variant A has lower reliability (99.1% vs 99.4%). The 30 bps OMS degradation is primarily driven by payment failures, not timeouts or inventory issues.

Mitigation options to consider:

  1. Ship with payment provider fallback logic (+2 eng-weeks; requires vendor contract amendment)
  2. Gate and defer to Q1 (vendor has committed to reliability improvements by Feb)
  3. Ship with guardrails + aggressive rollback threshold (accepts higher failure cost during ramp)

C1. Peak forecast

Week of Sessions p95 Sustained RPS
2025-11-10 2,000,000 750
2025-11-17 2,800,000 900
2025-11-24 (BFCM) 3,600,000 1,050
2025-12-01 2,600,000 820

Current tested capacity: ~400 RPS sustained at SLO latency (Q3); Q4 pre-provisioned to 800 RPS sustained (load test validation pending)

Requirement: Headroom ≥ 2.5× → design to ≥ 2,625 RPS (2.5× based on 2024 BFCM traffic spikes of 2.2×; buffer for safety)

Note: RPS reflects total requests, not sessions. Assume ~15 requests per session (page loads, API calls, assets). 3.6M sessions × 15 requests ÷ 604,800 seconds ≈ 89 RPS average; peak at 1,050 RPS = ~12× peak/average ratio.

Forecast confidence: BFCM sustained RPS forecast is 1,050 (p50). Historical forecast error shows p75 = 1,260 RPS (+20%), p95 = 1,575 RPS (+50%). The 2.5× buffer (2,625 RPS) covers the p95 worst-case forecast (1,575 RPS) plus margin for instantaneous traffic spikes above sustained load.

C2. OMS failure distribution (last 30 days, Control baseline)

  • Timeouts: 47%
  • Inventory mismatch / oversell: 28%
  • Payment gateway (systemic): 15%
  • Other: 10%

Note: This distribution reflects Control (legacy payment provider). In Variant A, payment gateway failures increase to ~38% of total failures due to new provider's lower reliability.

OMS retry policy: 2 retries with exponential backoff; timeouts not auto-retried (customers see immediate error)

C3. Inventory feed lag

  • p95 lag: 48 min (target ≤ 15 min)
  • Oversell rate: 0.6% (target ≤ 0.1%)

Note: Lag → oversell relationship is not linear; fixing may require both pipeline improvements and reservation logic changes


Candidate Initiatives (for prioritization)

These are the initiatives you should consider prioritizing or deferring:

  1. Variant A ship — checkout experiment with CVR lift but OMS risk
  2. OMS reliability — fix timeouts, inventory mismatch, payment gateway issues
  3. CWV improvements — LCP, INP, CLS optimization for web platform
  4. Peak capacity — scale to 2.5× headroom for BFCM
  5. Data pipeline latency — reduce 48 min lag to ≤15 min
  6. Legacy OrderBridge cutover — strangler migration to new services
  7. POS improvements — already at 99.86%, minimal gap to 99.9% (low priority; defer unless other initiatives complete early)

You cannot do all of these with 8 engineers in 5 weeks. Choose wisely.


Your 5 Tasks

Task 1: Outcome Targets + Pre-Freeze & Q1 Roadmap (required)

Deliverable: Slides 1-3 of your deck + spreadsheet tab

  • Set specific targets: CVR (bps improvement), checkout p95, CWV, OMS %
  • Identify 2-3 major initiatives (not 10) that move these metrics given 8 engineers and 5 weeks
  • For each: problem → hypothesis → outcome → owner → cost (eng weeks + $)
  • Show what you're explicitly deferring and why
  • Note: QA can support 2 major initiatives in parallel (see Context Brief, Section 2.2). A 3rd initiative requires sequencing, reduced test coverage, or contractor QA support.

Task 2: Experiment Decision — Variant A (required)

Deliverable: Slides 4-5 of your deck + 1-page experiment plan (PDF)

  • Decision: Ship immediately / Ship with guardrails / Gate / Iterate — with explicit rationale
  • If shipping: what guardrails and rollback triggers?
  • If gating: what OMS fixes are required first? Timeline?
  • If iterating: what's the re-test hypothesis?
  • Include: stat-sig requirements (assume α=0.05, power=0.80), alert thresholds, auto-rollback conditions
  • Quantify the trade-off: expected revenue gain vs. fulfillment failure cost
  • Assume failed order cost: $85 avg (CS: $25, refund/restock: $40, LTV impact: $20)
  • Note: We evaluate reasoning quality, not the specific decision. Ship, Gate, or Iterate can all score well if justified.

Task 3: Peak / Holiday Readiness (required)

Deliverable: Slide 6 of your deck + spreadsheet tab

  • Capacity plan: How do you get from current capacity to ≥ 2,625 RPS?
  • Load test plan: Profiles, pass/fail criteria, timeline (must complete before Nov 10)
  • DR drill: What, when, pass criteria
  • Freeze governance: Exception process, rollback requirements, on-call coverage
  • Incident response: Runbooks, escalation, page budget (target ≤ 2 pages/eng/week)

Task 4: Legacy Transition — OrderBridge Cutover (required)

Deliverable: Slide 7 of your deck

  • Strangler approach: What gets migrated first? Contract versioning?
  • Parallel run: How long? What SLIs determine readiness?
  • Cutover checklist: Go/no-go criteria, rollback plan (< 30 min)
  • Decommission milestones: % legacy endpoints retired by when?
  • Given 8 engineers, be realistic about timeline (may extend past 90 days)
  • Show phasing: What's achievable pre-freeze (Nov 10) vs. what extends into Q1?

Task 5: Web Platform — Path to CWV Targets (required)

Deliverable: Slide 8 of your deck

  • Rendering strategy: SSR/SSG/ISR by route (Home, PLP, PDP, Cart, Checkout)
  • Caching: Edge/CDN rules, cache keys, TTLs, invalidation
  • BFF responsibilities: Composition, back-pressure, timeout budgets
  • Target path: TTFB p75 → LCP p75 → INP p75 (show the math)
  • Rollout safety: Feature flags, canary %, automated rollback

Deliverables (strict limits)

Artifact Format Limit
Executive deck PDF 10 slides max
Model / calculations XLSX 1 file, multiple tabs OK
Experiment plan PDF 1 page

File naming:

  • SLT_VP_Case_<LastName>_Deck.pdf
  • SLT_VP_Case_<LastName>_Model.xlsx
  • SLT_VP_Case_<LastName>_ExperimentPlan.pdf

Live Review Agenda (60 min)

Segment Time Focus
Exec walkthrough 20 min Present as if to President: strategy, trade-offs, risks
Deep dive: Experiment + OMS 15 min Variant A decision, guardrails, OMS failure modes
Deep dive: Peak readiness 10 min Capacity math, load test, freeze governance
Deep dive: Web platform 10 min Caching strategy, TTFB→LCP path
Wrap-up / Q&A 5 min Anything we didn't cover

What "good" looks like

  • Makes a clear call on Variant A — not "it depends" without a recommendation
  • Shows the math — net orders, failure cost, capacity headroom
  • Acknowledges constraints — "With 8 engineers, we cannot do X before freeze; here's when we will"
  • Has specific guardrails — "Auto-rollback if OMS < 99.0% over 15-min window"
  • Defers explicitly — "Data pipeline latency is deferred to Q1; here's why it's lower priority"

Model expectations: Your spreadsheet should include (1) Variant A ROI calculation, (2) capacity math, (3) initiative cost estimates


What will hurt your score

  • Proposing a plan that requires 20+ engineers
  • Ignoring OMS degradation in Variant A decision
  • No rollback plan for any major change
  • Vague targets ("improve performance") instead of specific numbers
  • Hand-waving peak readiness ("we'll load test")

FAQ

Can I make assumptions? Yes — list them explicitly and bound uncertain decisions with options + trade-offs.

Do I need to write code? No. Diagrams, decision frameworks, and rollout plans are sufficient.

What if I think the targets are unrealistic? Say so — and propose what IS achievable with the constraints. That's a valid answer.

Should I address compliance/security/data pipelines? Only if directly relevant to your 5 tasks. We'll ask about these in the live session if needed.

What about backup slides for the live review? Use your 10-slide deck for the entire review; no backup slides needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment