A Developer's Guide to Bulletproof Deployments
Most Node.js production failures don't happen because of bad code—they happen because of missed configurations. An unvalidated environment variable, an unhandled promise rejection, or a missing graceful shutdown can bring down your entire service at 3 AM. This checklist distills years of production incidents into actionable items you can verify before every deployment. It's the difference between a smooth launch and a weekend outage.
Organize code for maintainability and scalability
- Vertical slice architecture: Organize code by business feature (
/modules/users/,/modules/orders/) not by technical layer (/controllers/,/services/). Code that changes together stays together. - Shared kernel isolation: Keep cross-cutting concerns in
/core/(database, middleware, errors). Never let business logic leak into shared code. - TypeScript mandatory: Use TypeScript with strict mode. Set
noUncheckedIndexedAccess: trueto catch array index errors at compile time. - ESM modules: Use
module: NodeNextin tsconfig.json. CommonJS is legacy—modern packages require ESM support. - Feature-scoped tests: Co-locate tests with features (
create-user.test.tsnext tocreate-user.service.ts). Reduces context switching. - Config single source of truth: Centralize all configuration in
/config/env.ts. Never accessprocess.envdirectly in business code.
Fail fast with invalid configuration
- Zod validation at startup: Use Zod to validate all environment variables. App should crash immediately if
DATABASE_URLis invalid, not when the first query runs. - Type coercion: Use
z.coerce.number()for ports and timeouts. Environment variables are always strings—don't forget to convert. - Secret strength enforcement: Enforce
JWT_SECRET.min(32)to prevent weak secrets. Policy should be in code, not documentation. - Development-only dotenv: Load
.envfiles only whenNODE_ENV !== 'production'. Production should use container environment variables. - Default values explicit: Set sensible defaults with
.default(). Missing optional config should work, not crash. - .env.example maintained: Keep a complete
.env.examplefile. New developers should know every required variable.
Lock down your application before exposing it
- Helmet middleware: Use
helmet()to set security headers (X-Frame-Options, HSTS, CSP). One line, massive protection. - CORS locked down: Set
cors({ origin: config.CORS_ORIGIN })with specific domains. Wildcard (*) is a security risk in production. - Rate limiting tiered: Implement global (1000/15min), route-specific (50/min), and auth-sensitive (5/hour) rate limits.
- Redis-backed rate limiter: Use
rate-limit-redisin multi-instance deployments. In-memory limiters don't sync across pods. - JWT algorithm fixed: Specify
algorithms: ['HS256']explicitly. Never allow algorithm negotiation—prevents "none" algorithm attacks. - Password hashing with Argon2id: Use Argon2id over bcrypt for new projects. It's the 2015 PHC competition winner with better GPU resistance.
- Body size limited: Set
express.json({ limit: '10kb' }). Large payloads can exhaust memory and block the event loop. - Dependency scanning: Run
npm auditor Snyk in CI. Catch vulnerabilities before they reach production.
Optimize for scale and responsiveness
- Connection pooling: Configure database pool size based on
(core_count * 2) + effective_spindle_count. Monitor with ORM metrics. - Trust proxy configured: Set
app.set('trust proxy', 1)behind load balancers. Otherwisereq.ipreturns internal IPs and rate limiting fails. - Compression enabled: Use
compression()middleware to reduce response sizes. CPU cost is worth the bandwidth savings. - Async error handling: Use
express-async-errorsor Express 5+ for automatic async error propagation. No more try-catch in every route. - Lazy loading avoided: Don't use dynamic imports for core business logic. Catch initialization errors at startup, not first request.
- Event loop monitoring: Use
clinic.jsor similar to detect event loop blocking. A blocked loop kills throughput.
Prevent the most common production bottlenecks
- Pool size tuned: Set maximum connections to 10-20 per instance. More connections != better performance.
- Connection timeout set: Configure 30 second timeout. Fail fast when database is unavailable.
- Query timeouts enforced: Set query timeouts to kill runaway queries. One bad query shouldn't block the pool.
- N+1 queries eliminated: Enable query logging in staging. Use eager loading (
include) for related data. - ORM transactions scoped: Keep transactions short. Long transactions hold locks and cause deadlocks.
- Migrations versioned: Use Prisma Migrate or Knex migrations. Never manually alter production schema.
- Indexes verified: Check EXPLAIN plans for sequential scans. Add indexes on foreign keys and WHERE clause columns.
Build systems that recover gracefully
- Operational vs programmer errors: Distinguish expected errors (user not found) from bugs (undefined reference). Handle differently.
- AppError class: Create custom error classes with
statusCodeandisOperationalflags. Centralize error response formatting. - Global error middleware: Place error handler last in middleware chain. Log full stack for bugs, sanitized message for clients.
- Unhandled rejection handling: Listen to
unhandledRejectionanduncaughtException. Log, alert, and gracefully shut down. - Graceful shutdown: Handle
SIGTERMandSIGINT. Stop accepting requests, finish in-flight work, close connections, then exit. - Shutdown timeout: Force exit after 10 seconds if graceful shutdown hangs. Better to restart than zombie.
- Circuit breakers: Wrap external service calls with circuit breakers. Prevent cascading failures across services.
Know what's happening in production at all times
- Pino over Winston: Use Pino for 5x better performance. Logging shouldn't block your event loop.
- Structured JSON logs: Log in JSON format with
level,time,reqId,msg. Enable log aggregation and querying. - Correlation IDs: Generate
X-Request-IDat entry. Thread it through all logs for request tracing. - AsyncLocalStorage: Use Node.js AsyncLocalStorage to propagate request context without manual passing.
- Sensitive data redacted: Configure Pino redact for
password,authorization,creditCard. Stay GDPR/PCI compliant. - Log levels correct: Use
INFOin production,DEBUGonly for specific packages.TRACEnever in production. - No console.log: ESLint rule to ban
console.log. Use logger everywhere for consistent formatting.
Verify correctness before production
- Integration tests preferred: Test full request→response cycles with Supertest. More value than mocked unit tests for APIs.
- Test database isolation: Use transactions or truncation between tests. Parallel tests shouldn't interfere.
- Coverage thresholds: Enforce 80% coverage gate in CI. Prevent test coverage regression over time.
- Security test cases: Write tests for rate limiting, auth failures, and input validation. Verify your security actually works.
- Environment parity: Run tests against real database (Docker). SQLite vs PostgreSQL differences cause production bugs.
- Flaky test policy: Mark flaky tests as skip with deadline to fix. Don't let them become ignored failures.
Ship confidently with these safeguards
- Multi-stage Docker builds: Separate build and runtime stages. Final image should have only compiled code and production deps.
- Non-root user: Run as
USER node. Never run containers as root—limits damage from container escapes. - dumb-init required: Use dumb-init as PID 1. Node.js doesn't properly handle signals (SIGTERM) as PID 1.
- Alpine base image: Use
node:20-alpinefor smaller images. Target under 200MB. - Production dependencies only: Run
npm ci --only=productionin final stage. No TypeScript or test frameworks. - Health endpoint: Implement
/healthreturning component status. Kubernetes liveness/readiness probes require this. - Smoke tests automated: Run health check and critical API test immediately after deploy before routing traffic.
Avoid these production killers
- Floating promises: ESLint rule
@typescript-eslint/no-floating-promisescatches forgotten awaits. Unhandled rejections crash apps. - Memory leaks: Monitor heap over time. Growing heap after GC indicates leak. Use
--inspectand heap snapshots. - Callback hell: Use async/await exclusively. Nested callbacks are unreadable and error-prone.
- process.exit in handlers: Never call
process.exit()from request handlers. Finish requests first. - Blocking event loop: No synchronous file I/O, no CPU-heavy computation. Offload to worker threads.
- Hardcoded timeouts: Externalize all timeout values. Allow tuning without redeployment.
Final checks before you hit deploy
- Feature flags validated: Ensure toggles are in the correct state. Test both enabled/disabled paths in staging.
- Database migrations tested: Run migrations in staging first. Verify rollback scripts work.
- Load test executed: Simulate production traffic. Identify memory leaks and bottlenecks under load.
- Dependency audit passed: Run
npm audit --production. No high or critical vulnerabilities. - Type check clean: Run
tsc --noEmit. No TypeScript errors allowed. - Lint check clean: Run ESLint. No warnings, no errors.
- Config drift eliminated: Compare staging and prod environment variables. Use infrastructure-as-code.
- Rollback plan documented: Know how to revert the deployment. Test rollback in staging.
Verify production is actually healthy
- Health endpoint returns 200: Check
/healthshows all components UP (database, cache, external APIs). - Metrics flowing: Confirm dashboards show live data. Check request rates, error rates, and latency.
- Error rate within SLO: Monitor 5xx errors. Should be <0.1% of total requests within first 10 minutes.
- Database connections active: Verify connection pool metrics show activity. No stuck connections.
- Cache hit rate normal: Check cache metrics. Sudden drop indicates invalidation or connection issues.
- Memory stable: Monitor heap over 30-60 minutes. Should stabilize after GC, not climb continuously.
- No unhandled rejections: Check logs for uncaught promise rejections. Should be zero.
Run efficiently without sacrificing visibility
- Log retention tuned: Keep 7-30 days hot, archive rest to cold storage. Avoid infinite retention.
- Metric cardinality managed: Limit high-cardinality labels (user IDs, URLs). Sampling for traces.
- Dead code removed: Identify and remove unused endpoints. Every endpoint consumes resources.
- Auto-scaling configured: Use HPA based on CPU/memory or custom metrics. Scale down during off-peak.
- Right-sized containers: Profile actual usage. Start conservative, scale based on data.
- Cost alerts enabled: Set spending alerts at 80% of monthly budget. No surprise bills.
Lessons learned from production incidents
- The AsyncLocalStorage trap: Don't store mutable objects in AsyncLocalStorage. Changes propagate to all concurrent requests sharing the context. Clone objects before modification.
- Event loop blocking discovery: We had 2-second response times until we profiled and found a synchronous XML parser. Always use async alternatives (
fast-xml-parserasync mode). - Connection leak nightmare: Set
pool.idleTimeoutMillisand monitor active connections. We discovered unclosed database connections during HTTP timeout handling. - Health check DO's and DON'Ts: Our health check called 5 external APIs taking 8 seconds. Kubernetes killed healthy pods. Keep checks under 1 second—check local state only.
- The missing await: One forgotten
awaiton a database insert caused silent data loss for 3 days. ESLintno-floating-promisesrule is mandatory.
Use this checklist before every deployment. Print it, share it with your team, and pin it to your deployment runbook. The best production incidents are the ones that never happen because you caught them here.
Production readiness isn't a phase—it's a mindset.
開發者的穩健部署指南
大多數 Node.js 正式環境的故障並非源於程式碼問題,而是源於遺漏的配置。一個未驗證的環境變數、未處理的 Promise rejection,或是缺失的優雅關機機制,都可能在凌晨三點讓你的整個服務癱瘓。這份檢查清單將多年的正式環境事故經驗濃縮成可操作的項目,讓你在每次部署前都能逐一確認。這就是順利上線與週末加班救火之間的差別。
為可維護性和擴展性組織程式碼
- 垂直切片架構:按業務功能組織程式碼(
/modules/users/、/modules/orders/),而非按技術層級(/controllers/、/services/)。會一起改變的程式碼放在一起。 - 共享核心隔離:將跨領域關注點放在
/core/(資料庫、中間件、錯誤處理)。絕不讓業務邏輯洩漏到共享程式碼中。 - TypeScript 強制要求:使用 TypeScript 並啟用嚴格模式。設定
noUncheckedIndexedAccess: true以在編譯時捕捉陣列索引錯誤。 - ESM 模組:在 tsconfig.json 中使用
module: NodeNext。CommonJS 已是過去式——現代套件需要 ESM 支援。 - 功能範圍測試:將測試與功能放在一起(
create-user.test.ts與create-user.service.ts相鄰)。減少上下文切換。 - 配置單一真相來源:將所有配置集中在
/config/env.ts。絕不在業務程式碼中直接存取process.env。
配置無效時快速失敗
- 啟動時 Zod 驗證:使用 Zod 驗證所有環境變數。若
DATABASE_URL無效,應用程式應立即崩潰,而非在第一次查詢時才出錯。 - 型別強制轉換:對連接埠和逾時使用
z.coerce.number()。環境變數永遠是字串——別忘了轉換。 - 密鑰強度強制執行:強制要求
JWT_SECRET.min(32)以防止弱密鑰。政策應在程式碼中,而非文件中。 - 僅開發環境載入 dotenv:僅在
NODE_ENV !== 'production'時載入.env檔案。正式環境應使用容器環境變數。 - 預設值明確:使用
.default()設定合理的預設值。缺少選填配置應能運作,而非崩潰。 - 維護 .env.example:保持完整的
.env.example檔案。新開發者應知道每個必要的變數。
在對外開放之前鎖定你的應用程式
- Helmet 中間件:使用
helmet()設定安全標頭(X-Frame-Options、HSTS、CSP)。一行程式碼,巨大的保護。 - CORS 鎖定:設定
cors({ origin: config.CORS_ORIGIN })並指定特定網域。在正式環境中使用萬用字元(*)是安全風險。 - 分層速率限制:實作全域(1000/15 分鐘)、路由特定(50/分鐘)和驗證敏感(5/小時)速率限制。
- Redis 支援的速率限制器:在多實例部署中使用
rate-limit-redis。記憶體內限制器無法跨 Pod 同步。 - JWT 演算法固定:明確指定
algorithms: ['HS256']。絕不允許演算法協商——防止「none」演算法攻擊。 - 使用 Argon2id 雜湊密碼:新專案優先使用 Argon2id 而非 bcrypt。它是 2015 年 PHC 競賽冠軍,具有更好的 GPU 抵抗力。
- Body 大小限制:設定
express.json({ limit: '10kb' })。大型 payload 可能耗盡記憶體並阻塞事件迴圈。 - 相依性掃描:在 CI 中執行
npm audit或 Snyk。在到達正式環境之前捕捉漏洞。
為擴展性和回應速度進行最佳化
- 連線池化:根據
(核心數 * 2) + 有效磁碟主軸數配置資料庫池大小。透過 ORM 指標進行監控。 - 信任代理配置:在負載平衡器後設定
app.set('trust proxy', 1)。否則req.ip回傳內部 IP,速率限制失效。 - 啟用壓縮:使用
compression()中間件減少回應大小。CPU 成本值得換取頻寬節省。 - 非同步錯誤處理:使用
express-async-errors或 Express 5+ 進行自動非同步錯誤傳播。不再需要在每個路由中使用 try-catch。 - 避免延遲載入:不要對核心業務邏輯使用動態匯入。在啟動時捕捉初始化錯誤,而非第一次請求時。
- 事件迴圈監控:使用
clinic.js或類似工具檢測事件迴圈阻塞。阻塞的迴圈會扼殺吞吐量。
預防最常見的正式環境瓶頸
- 池大小調校:將每個實例的最大連線數設定為 10-20。更多連線 != 更好的效能。
- 連線逾時設定:配置 30 秒逾時。資料庫不可用時快速失敗。
- 查詢逾時強制執行:設定查詢逾時以終止失控的查詢。一個壞查詢不應阻塞整個連線池。
- 消除 N+1 查詢:在預備環境啟用查詢日誌。對相關資料使用 eager loading(
include)。 - ORM 交易範圍化:保持交易簡短。長交易會持有鎖定並導致死鎖。
- 遷移版本化:使用 Prisma Migrate 或 Knex migrations。絕不手動修改正式環境 Schema。
- 索引驗證:檢查 EXPLAIN 計畫是否有順序掃描。在外鍵和 WHERE 子句欄位上新增索引。
建構能優雅恢復的系統
- 操作型 vs 程式設計型錯誤:區分預期錯誤(使用者不存在)和 Bug(undefined 引用)。不同處理方式。
- AppError 類別:建立自訂錯誤類別,包含
statusCode和isOperational標記。集中錯誤回應格式化。 - 全域錯誤中間件:將錯誤處理器放在中間件鏈的最後。對 Bug 記錄完整堆疊,對客戶端回傳清理過的訊息。
- 未處理 rejection 處理:監聽
unhandledRejection和uncaughtException。記錄、告警並優雅關機。 - 優雅關機:處理
SIGTERM和SIGINT。停止接受請求、完成進行中的工作、關閉連線,然後退出。 - 關機逾時:如果優雅關機卡住,10 秒後強制退出。重啟比殭屍進程好。
- 斷路器:用斷路器包裝外部服務呼叫。防止跨服務的級聯故障。
隨時掌握正式環境的運行狀況
- Pino 優於 Winston:使用 Pino 獲得 5 倍更好的效能。日誌記錄不應阻塞你的事件迴圈。
- 結構化 JSON 日誌:以 JSON 格式記錄,包含
level、time、reqId、msg。啟用日誌聚合和查詢。 - 關聯 ID:在入口生成
X-Request-ID。將其貫穿所有日誌以進行請求追蹤。 - AsyncLocalStorage:使用 Node.js AsyncLocalStorage 傳播請求上下文,無需手動傳遞。
- 敏感資料遮蔽:配置 Pino redact 處理
password、authorization、creditCard。保持 GDPR/PCI 合規。 - 正確的日誌等級:正式環境使用
INFO,僅對特定套件使用DEBUG。TRACE絕不用於正式環境。 - 禁止 console.log:ESLint 規則禁止
console.log。處處使用 logger 以保持格式一致。
在正式環境之前驗證正確性
- 優先整合測試:使用 Supertest 測試完整的請求→回應週期。對 API 來說比 mock 的單元測試更有價值。
- 測試資料庫隔離:在測試之間使用交易或清空。平行測試不應相互干擾。
- 覆蓋率門檻:在 CI 中強制 80% 覆蓋率門檻。防止測試覆蓋率隨時間退化。
- 安全測試案例:為速率限制、驗證失敗和輸入驗證編寫測試。驗證你的安全措施確實有效。
- 環境對等性:針對真實資料庫(Docker)執行測試。SQLite vs PostgreSQL 的差異會導致正式環境 Bug。
- 不穩定測試政策:將不穩定的測試標記為 skip 並設定修復期限。不要讓它們變成被忽略的失敗。
有這些保障措施,放心交付
- 多階段 Docker 建置:分離建置和執行階段。最終映像應只有編譯後的程式碼和生產依賴。
- 非 root 使用者:以
USER node執行。絕不以 root 執行容器——限制容器逃逸的損害。 - dumb-init 必需:使用 dumb-init 作為 PID 1。Node.js 作為 PID 1 無法正確處理信號(SIGTERM)。
- Alpine 基礎映像:使用
node:20-alpine獲得更小的映像。目標低於 200MB。 - 僅生產依賴:在最終階段執行
npm ci --only=production。不要包含 TypeScript 或測試框架。 - 健康端點:實作回傳元件狀態的
/health。Kubernetes 的存活/就緒探針需要這個。 - 自動化冒煙測試:在部署後、路由流量前立即執行健康檢查和關鍵 API 測試。
避免這些正式環境殺手
- 浮動 Promise:ESLint 規則
@typescript-eslint/no-floating-promises捕捉遺忘的 await。未處理的 rejection 會讓應用程式崩潰。 - 記憶體洩漏:隨時間監控堆積。GC 後堆積持續增長表示洩漏。使用
--inspect和堆積快照。 - Callback 地獄:完全使用 async/await。巢狀 callback 難以閱讀且容易出錯。
- 處理器中的 process.exit:絕不從請求處理器呼叫
process.exit()。先完成請求。 - 阻塞事件迴圈:禁止同步檔案 I/O、禁止 CPU 密集計算。卸載到 worker threads。
- 寫死的逾時值:將所有逾時值外部化。允許無需重新部署即可調整。
按下部署前的最終確認
- 驗證功能開關:確保開關處於正確狀態。在預備環境測試啟用/停用兩種路徑。
- 測試資料庫遷移:先在預備環境執行遷移。驗證回滾腳本有效。
- 執行負載測試:模擬正式環境流量。在負載下識別記憶體洩漏和瓶頸。
- 相依性稽核通過:執行
npm audit --production。不允許高或嚴重漏洞。 - 型別檢查乾淨:執行
tsc --noEmit。不允許 TypeScript 錯誤。 - Lint 檢查乾淨:執行 ESLint。無警告、無錯誤。
- 消除配置漂移:比較預備環境和正式環境的環境變數。使用基礎設施即程式碼。
- 文件化回滾計畫:知道如何還原部署。在預備環境測試回滾。
驗證正式環境確實健康
- 健康端點回傳 200:檢查
/health顯示所有元件為 UP(資料庫、快取、外部 API)。 - 指標正常流動:確認儀表板顯示即時資料。檢查請求率、錯誤率和延遲。
- 錯誤率在 SLO 範圍內:監控 5xx 錯誤。應在前 10 分鐘內低於總請求的 0.1%。
- 資料庫連線活躍:驗證連線池指標顯示有活動。無卡住的連線。
- 快取命中率正常:檢查快取指標。突然降至表示失效或連線問題。
- 記憶體穩定:監控 30-60 分鐘的堆積。應在 GC 後穩定,而非持續攀升。
- 無未處理 rejection:檢查日誌是否有未捕獲的 promise rejection。應為零。
在不犧牲可見性的情況下高效運行
- 日誌保留期調校:熱儲存保留 7-30 天,其餘歸檔至冷儲存。避免無限期保留。
- 指標基數管理:限制高基數標籤(使用者 ID、URL)。對追蹤使用取樣。
- 移除死程式碼:識別並移除未使用的端點。每個端點都會消耗資源。
- 配置自動擴展:根據 CPU/記憶體或自訂指標使用 HPA。在離峰時段縮減規模。
- 適當調整容器大小:分析實際使用量。從保守開始,根據數據擴展。
- 啟用成本告警:在月預算的 80% 設定支出告警。沒有帳單驚喜。
從正式環境事故中學到的經驗
- AsyncLocalStorage 陷阱:不要在 AsyncLocalStorage 中存放可變物件。更改會傳播到所有共享上下文的並發請求。修改前先複製物件。
- 事件迴圈阻塞發現:我們曾有 2 秒的回應時間,直到我們分析並發現同步 XML 解析器。始終使用非同步替代方案(
fast-xml-parser非同步模式)。 - 連線洩漏噩夢:設定
pool.idleTimeoutMillis並監控活躍連線。我們發現 HTTP 逾時處理期間存在未關閉的資料庫連線。 - 健康檢查的該做與不該做:我們的健康檢查呼叫了 5 個外部 API,耗時 8 秒。Kubernetes 終止了健康的 Pod。保持檢查在 1 秒以內——只檢查本地狀態。
- 遺忘的 await:一個遺忘的
await在資料庫插入上導致 3 天的靜默資料遺失。ESLintno-floating-promises規則是強制的。
在每次部署前使用這份檢查清單。列印出來、與團隊分享,並將其釘在你的部署運維手冊上。最好的正式環境事故就是那些因為你在這裡提前發現而從未發生的事故。
正式環境就緒不是一個階段——而是一種心態。