This document is provided to candidates alongside the case study. It defines the operating environment, constraints, and how to interpret the data.
- You are joining as VP of Product Management & Tech at Sur La Table.
- You own product + software platforms end-to-end: e-commerce (Next.js/headless), BFF/API gateway, services, OMS, POS software, data warehouse/ETL, loyalty, personalization, ERP integrations.
- SLT-only operating model: there is no CSC matrixed engineering to rely on.
- Store hardware/network operations are owned day-to-day by the IT Director, but you set standards/security policy and approve major infra changes.
- This take-home case uses synthetic but internally consistent metrics. Your job is to make decisions and plans using what's here, while clearly stating bounded assumptions.
- Sr. Director of Engineering (software engineering, SRE/platform, QA)
- Head of Product (PMs, UX partnership)
- IT Director (store infrastructure/devices, networking, endpoint management)
| Function | Headcount | Notes |
|---|---|---|
| Software Engineers | 8 | Full-stack and backend; no dedicated frontend |
| Product Managers | 2 | One Sr PM (web/checkout), one PM (store systems) |
| QA | 4 | Manual + automation; owns release confidence |
| SRE/Platform | 2 | SLOs, observability, incident response, infra |
| App Support | 2 | L1/L2 triage, escalation to engineering |
| Total | 18 |
Implication: With 8 engineers, you cannot pursue every initiative. Prioritization under constraint is the test. The 2 allowed contractors (per budget rules) are a critical lever.
Contractor economics: $180/hr fully loaded; 2-week onboarding; ~3 weeks productive time before freeze
QA capacity: Can support 2 major initiatives in parallel; beyond that requires sequencing or external help
- No formal squads — engineers are assigned to work as needed
- Typical splits: ~4 engineers on web/checkout, ~2 on OMS/services, ~2 on data/integrations
- You may propose a different structure in your plan
- You are final decision maker on roadmap trade-offs, release gates, and investment prioritization.
- IT Director must approve changes affecting store networks/devices; you co-sign major changes impacting store-cloud segmentation.
- Architecture & Experiment Review Board (ARB) is your mechanism for enforcing standards/guardrails.
- ARB composition: VP (chair) + Sr Dir Eng + Head of Product
- Cadence: Meets weekly; fast-track approval available for critical items
With 8 engineers × 5 weeks (40 eng-weeks) + 2 contractors × ~3 productive weeks (6 eng-weeks, conservative) = 46-47 eng-weeks available.
Rough sizing:
| Initiative | Typical Effort |
|---|---|
| Variant A ship (with guardrails) | 4-6 eng-weeks (+2 if payment fallback) |
| OMS reliability (timeout fixes) | 6-8 eng-weeks |
| CWV quick wins (ISR, caching) | 4-6 eng-weeks |
| Peak capacity (scaling + load test) | 6-8 eng-weeks |
| Legacy OrderBridge (partial) | 8-12 eng-weeks |
Implication: You can realistically complete 2-3 medium initiatives pre-freeze. Proposing 5+ parallel initiatives signals unrealistic planning.
- Premium cookware, kitchenware, and gourmet foods — Williams-Sonoma competitor positioning
- Cooking classes — in-store and online; meaningful revenue line (~10% of total)
- Gift registry — weddings, housewarmings
- E-commerce: ~55% of revenue (growing)
- Retail stores: ~50 locations nationally; ~35% of revenue
- Cooking classes: ~10% of revenue (high margin, drives store traffic)
- Q4 is everything: Nov-Dec represents ~40% of annual revenue
- Key dates: Black Friday, Cyber Monday, holiday gifting
- Cookware as gifts: High AOV, high consideration purchase
- Williams-Sonoma, Crate & Barrel, Amazon (commoditization)
- Differentiation through expertise, classes, curated assortment
- Web storefront: Next.js headless application (current; needs performance work)
- Key journeys: Home → PLP → PDP → Cart → Checkout
- Performance instrumentation: Core Web Vitals from real-user monitoring (RUM)
- Current capacity: ~400 RPS sustained at SLO latency (load tested in Q3)
- Q4 pre-provisioned: Auto-scaling configured to 1,200 RPS burst; 800 RPS sustained (provisioned; load test validation pending)
- Remaining gap to BFCM (2,625 RPS): CDN optimization, caching tuning, and load test validation required
- BFF/API gateway: request aggregation/composition, auth/session handling, rate limiting/back-pressure, response caching headers
- Core services: cart/pricing/promo, catalog/search integration, loyalty/personalization, payment orchestration
- OMS: order create, inventory reservation/commit, fulfillment orchestration; integrates with ERP and store systems
- Primary reliability KPI: OMS Success % (see definitions below)
- Known issues: Timeouts (47% of failures), inventory mismatch (28%), payment gateway (15%)
- POS software: store transactions (sales/returns), payment flows, device interactions
- Primary reliability KPI: POS Success % during store hours
- PCI scope includes POS
- Current state: 99.86% success; minimal gap to 99.9% target — low priority unless other initiatives complete early
- Warehouse + ETL/CDC: supports BI, experiment analysis, operational dashboards
- Primary latency KPI: ops feeds (orders/inventory) available in DWH within target window
- Known issue: p95 lag is 48 min (target ≤ 15 min); contributes to 0.6% oversell rate
- "OrderBridge" — legacy order/checkout integration layer being strangled
- "Legacy endpoints" = APIs still required for order completion or downstream fulfillment
- You will propose a strangler + parallel run + cutover plan with decommission milestones
- Realistic expectation: Full cutover may extend past the 5-week pre-freeze window. Plan what's achievable pre-freeze vs. Q1.
You do not need to pick vendors. Describe capabilities.
- Platform capable of: randomization, exposure logging, guardrails, sequential decisioning, phased rollouts, kill switch
- Assume something like LaunchDarkly or Statsig
- RUM: CWV p75 metrics (LCP/INP/CLS) and route-level breakdown
- APM/tracing: service latency, dependency maps, error rates (Datadog-like)
- Logs: structured logs with correlation IDs across web/BFF/services/OMS
- Synthetics: checkout probes and key journey monitors
- Automated pipelines for web and services
- Support for feature flags, canaries, and rapid rollback
- Ability to enforce release checklists and block deploys when guardrails fail
- Cloud: AWS
- Compute: ECS for services; considering Lambda@Edge for Next.js edge rendering
- CDN: CloudFront (current); edge caching not fully optimized
- CDN optimization note: Quick wins (cache rules, TTLs) achievable pre-freeze; advanced optimization (Lambda@Edge, custom origins) should be phased post-peak
| Metric | Current | Target |
|---|---|---|
| P1 incidents | 1.2/week | ≤1/week |
| MTTR | 45 min | ≤30 min |
| On-call load | 2 eng/week rotation | Maintain |
| Release cadence | Web 3×/week, services 2×/week, OMS weekly | Maintain |
| Change failure rate | 8% | <5% (stretch) |
| Rollback rate | 6% | <5% |
Note: Rollback plans must be tested in staging before production use.
- Code freeze begins Nov 10
- Exceptions allowed only via Major Release Go/No-Go with:
- SLO targets satisfied or explicitly risk-accepted
- Rollback tested and time-bounded (< 30 min)
- Dashboards/alerts in place for primary metrics + guardrails
- Named owner on-call during change window
- Opex remaining pre-peak: $1.2M (for contractors + cloud/infra + tooling through Q4; engineer salaries separate)
- Net new hires frozen; you may use up to 2 contractors
- Avoid engineering becoming a procurement bottleneck
Current infra run rate: ~$180K/month (compute: $110K, CDN: $35K, observability: $25K, misc: $10K)
Peak scaling assumption: 1.5-2× compute cost during Nov-Dec (~$165-220K/month for compute alone)
- Freeze exceptions require VP approval + President notification
- Emergency changes during peak require VP + Sr Dir Eng sign-off
- On-call engineer has authority to rollback without approval
- Today (case context): Early October 2025
- Nov 10: Code freeze begins
- Nov 24-Dec 2: BFCM / peak week
- You have ~5 weeks to execute pre-freeze initiatives
Note on basis points (bps): 1 bps = 0.01 percentage points. Example: CVR moving from 2.00% to 2.10% is a +10 bps improvement.
| Metric | Definition |
|---|---|
| CVR (site-wide) | Successful orders / sessions |
| AOV | Revenue / successful orders |
| Checkout p95 latency | p95 end-to-end time from "Place order" click to confirmation render |
| CWV p75 | p75 on mobile from RUM: LCP, INP, CLS |
| OMS Success % | Successful order-creates / order-create attempts. Each automatic retry counts as a separate attempt. Manual customer retries (re-submitting after error page) count as new order attempts. |
| POS Success % | Successful store transactions / attempts during store hours |
| P1 MTTR | Median time from page to mitigation for P1 incidents |
| Error budget burn | % of quarterly error budget consumed |
| Ops → DWH latency | p95 minutes from source commit to warehouse availability |
| Change failure rate | % of deploys causing customer-impacting degradation |
| Rollback rate | % of deploys rolled back |
| Cost per order | All-in cloud/platform/observability cost ÷ orders |
- PCI-DSS applies to checkout/payment flows and includes POS
- Store-cloud network segmentation required:
- Store networks segmented from corporate and public
- Least privilege connectivity to cloud endpoints
- Logging/monitoring for store-origin traffic
- GDPR/CPRA applies for customer data handling
You may propose changes across:
- Next.js rendering strategy and caching (SSR/SSG/ISR)
- Edge/CDN configuration
- BFF/API gateway composition/back-pressure
- Service reliability improvements (timeouts/retries/circuit breakers, idempotency, queueing)
- OMS integration robustness (inventory reservation timing, contract versioning, failure handling)
- Release governance (guardrails, go/no-go, freeze exceptions)
- Team structure and contractor allocation
We are not providing full system diagrams, vendor names, or code. When information is missing:
- State a minimal assumption
- Propose 1-2 options
- Pick one and explain trade-offs and verification
- Keep plans consistent with stated constraints
With 8 engineers and ~5 weeks to code freeze, you face hard trade-offs:
Note: Variant A (BFF changes) and CWV initiatives (Next.js/BFF optimization) share infrastructure and engineering resources. Consider sequencing when planning.
| Option | Risk |
|---|---|
| Ship Variant A for CVR lift | OMS reliability drops; fulfillment failures during peak |
| Fix OMS first, delay Variant A | Miss CVR opportunity; may not hit conversion targets |
| Do both in parallel | Spread 8 engineers too thin; neither done well |
| Hire contractors for one workstream | Onboarding time; quality risk |
There is no "right" answer — there are defensible trade-offs. We're evaluating your judgment, not your ability to do everything.