Skip to content

Instantly share code, notes, and snippets.

@melvinlee
Last active April 14, 2026 13:41
Show Gist options
  • Select an option

  • Save melvinlee/1da4904d71abf85447a842093b5018c9 to your computer and use it in GitHub Desktop.

Select an option

Save melvinlee/1da4904d71abf85447a842093b5018c9 to your computer and use it in GitHub Desktop.
kong-debug
---
name: kong-debug
description: Systematic debugging guide for Kong Gateway issues without relying on the Admin API. Use this skill whenever the user mentions Kong Gateway, Kong proxy, Kong plugins, 502/503/504 errors from Kong, Kong routes or services not working, Kong configuration issues, Konnect troubleshooting, or any API gateway problems that involve Kong. Trigger even if the user just says "my API gateway is broken" and Kong is in context. This variant focuses on proxy-side signals (response headers, logs, declarative config, container/network probes) for environments where the Admin API is disabled, locked down, or unreachable (e.g., Konnect data planes, hardened production).
---
# Kong Gateway Debugging (No Admin API)
Systematic playbook for diagnosing Kong Gateway issues using only proxy-side signals: response headers, logs, declarative config, container state, and network probes. This matches the reality of most production Kong deployments, where the Admin API is intentionally not exposed.
Most Kong problems fall into one of five categories: Kong itself is unhealthy, routing is misconfigured, the upstream is unreachable, a plugin is misbehaving, or there's a network/TLS issue.
---
## 0. Pre-Flight — Scope and Endpoint
### 0.1 State the scope disclaimer FIRST
Before asking anything else, before running any commands, open the conversation with a clear one-liner:
> **"Heads up: I can only debug against non-production environments. If the issue is on prod, I can help you investigate read-only signals (metrics, logs, response headers) but I won't run changes or probes against prod. Which environment are we looking at (e.g dev, staging)?"**
This is non-negotiable and goes at the top of the *first* reply to any Kong debug request — even if the user has already named the environment. Restating it sets expectations, prevents surprises mid-session, and gives the user an explicit moment to correct you if they meant prod.
**Why:** Kong debugging commands include active probes (curl to the proxy, `kubectl exec`, plugin toggles via declarative re-sync, restart-to-reset-health-state). Running those against prod without authorization is high-blast-radius. Making the constraint visible up front is cheaper than explaining mid-incident why you won't run a given command.
**What to do when the user says prod:** stop the debug flow and direct them to the **Production Support team**. Do not attempt to triage, do not query VM for prod, do not draft commands, do not analyze prod logs or headers the user pastes. A short, firm redirect:
> "This is a production issue — I can't debug prod. Please contact the Production Support team so they can engage the right on-call and follow the incident process. I'm happy to help once you're back on a non-prod environment (staging, dev, sandbox)."
Do not negotiate around this. If the user insists ("just take a quick look", "it's urgent"), repeat the redirect. Production incidents need the support team's audit trail, paging, and change-control — not an ad-hoc debug session.
### 0.2 Then ask for the endpoint
After the scope disclaimer, ask the user for:
1. **Kong proxy FQDN** — e.g. `api.example.com` (the hostname the client hits)
2. **Path** — e.g. `/v1/orders/create`
3. *(optional, if known)* HTTP method, request body shape, and the full failing response (status code + headers if they have them)
**Why this matters:** Every subsequent check — route matching (§3), upstream probing (§4), plugin isolation (§5), TLS (§7) — is parameterized by the exact FQDN and path the user is hitting. Running a curl against the wrong path, or assuming the wrong host, will produce misleading headers and send the whole diagnosis down the wrong branch. A 30-second clarification up front saves an iteration of wrong answers.
Do not skip this step even if the user seems to have already hinted at the endpoint. Confirm the exact values (quoted verbatim) before continuing. If the user answers vaguely ("the payments API"), push for the literal hostname and path they would type into curl.
Once you have FQDN and path, make them concrete in every command you run — substitute them into `<kong-proxy>` and `<path>` placeholders below. Do not run generic commands against localhost or example.com.
---
## 1. Triage First — What Are You Seeing?
Nail down the symptom before digging in:
| Symptom | Jump to |
|---------|---------|
| Kong returns 404 | §3 Route/Service config |
| Kong returns 502 / 503 | §4 Upstream connectivity |
| Kong returns 504 Gateway Timeout | §4 Upstream connectivity |
| Kong returns 401 / 403 | §5 Plugin behavior |
| Kong returns 500 | §2 Kong health + §5 Plugins |
| Request never reaches upstream | §3 Route matching |
| Plugin not applying / applying twice | §5 Plugin behavior |
| High latency through Kong | §6 Log analysis |
Always capture the **full response headers**. Kong injects `X-Kong-*` headers that tell you exactly which service, route, and upstream were matched — this is the single most valuable signal when the Admin API is off-limits.
```bash
curl -sv https://<kong-proxy>/<path> 2>&1 | grep -iE "^< |X-Kong|Server:"
```
Key headers to read:
- `X-Kong-Proxy-Latency` — time Kong spent on plugins/routing
- `X-Kong-Upstream-Latency` — time waiting for upstream
- `X-Kong-Route-Id` / `X-Kong-Service-Id` — which config matched (absent on 404 → no route matched at all)
- `X-Kong-Request-Id` — correlates the proxy request with entries in Kong's error/access log
- `Via: kong/<version>` — confirms the response actually came from Kong (not an upstream LB)
**If there is no `X-Kong-*` header on the response, the request never reached Kong.** Check your DNS, load balancer, or ingress in front of Kong before doing anything else.
---
## 2. Kong Health Check
**Always start by querying Victoria Metrics.** Kong's Prometheus plugin exports liveness, latency, and request-volume metrics that VM scrapes — this is the authoritative source of truth for "is Kong healthy right now" in this platform, and it works without Admin API, without shell access to the pod, and across every DP in the fleet at once.
### Query Victoria Metrics first
Hit the VM HTTP API (`/api/v1/query` — PromQL-compatible) with these checks, in order:
```bash
VM=<your-victoria-metrics-url> # e.g. http://vmselect.monitoring.svc:8481/select/0/prometheus
# 1. Is Kong up? (one series per DP instance — expect value=1 for each)
curl -s "$VM/api/v1/query" --data-urlencode 'query=up{job=~"kong.*"}' | jq '.data.result'
# 2. Is Kong actually serving traffic? (request rate over last 5m)
curl -s "$VM/api/v1/query" --data-urlencode 'query=sum by (instance) (rate(kong_http_requests_total[5m]))' | jq '.data.result'
# 3. Are any DPs returning 5xx? (error rate by service)
curl -s "$VM/api/v1/query" --data-urlencode 'query=sum by (service, code) (rate(kong_http_requests_total{code=~"5.."}[5m]))' | jq '.data.result'
# 4. Upstream latency (p95 over last 5m) — catches slow upstreams before timeouts
curl -s "$VM/api/v1/query" --data-urlencode 'query=histogram_quantile(0.95, sum by (le, service) (rate(kong_upstream_latency_ms_bucket[5m])))' | jq '.data.result'
# 5. Memory pressure (leaks, OOM precursor)
curl -s "$VM/api/v1/query" --data-urlencode 'query=kong_memory_lua_shared_dict_bytes{shared_dict="kong_db_cache"}' | jq '.data.result'
```
Reading the results:
- **`up == 0` for any instance** → that DP is down or unreachable from the scraper. Skip to container/pod checks below for that specific instance only.
- **`up == 1` but request rate is 0** → Kong is running but nothing is reaching it (DNS, ingress, or LB problem in front of Kong).
- **`up == 1` and request rate > 0** → Kong itself is fine. The issue is downstream — jump to §3 (routing) or §4 (upstreams) based on the symptom.
- **5xx rate spiking on a specific `service` label** → go straight to §4 for that service; don't waste time on global health.
### Useful Kong Prometheus metrics reference
| Metric | What it tells you |
|--------|-------------------|
| `up{job="kong"}` | Scrape target alive (1/0) |
| `kong_http_requests_total` | Request count by service/route/code/consumer |
| `kong_latency_ms_bucket` (`type="kong"`) | Time Kong itself spent (plugins + routing) |
| `kong_upstream_latency_ms_bucket` | Time waiting for upstream |
| `kong_bandwidth_bytes` | Ingress/egress bytes by service |
| `kong_nginx_connections_total` | Active/reading/writing/waiting connections |
| `kong_memory_lua_shared_dict_bytes` | Shared dict usage (cache pressure) |
| `kong_datastore_reachable` | DB reachability (DB-backed mode) |
If a metric you expect is missing from VM, the Prometheus plugin isn't enabled globally, or the scrape config isn't matching Kong's metrics endpoint. Fall back to the pod/container checks below — but also fix the monitoring gap, because you'll want this for the next incident.
### Fallback: only if VM is unavailable or an instance is `up == 0`
```bash
# Kubernetes
kubectl get pods -n <namespace> -l app=kong
kubectl describe pod -n <namespace> <kong-pod>
kubectl logs -n <namespace> deploy/kong -c proxy --tail=100
# Proxy ports actually bound?
ss -tlnp | grep -E ":8000|:8443"
```
Look for crash loops, OOM kills, readiness probe failures, or `kong: [error]` lines during startup. A Kong that won't start almost always logs the reason in the first ~50 lines of its error log (bad config, DB unreachable at boot, plugin schema error).
---
## 3. Route & Service Configuration (Declarative / Gitops)
Without the Admin API, your source of truth is the declarative config — typically a `kong.yml` (or `.yaml`) file, a decK dump, or a Kong Ingress Controller CRD set.
### Find the active config
```bash
# DB-less Kong — config is mounted into the container
kubectl exec -n <ns> <kong-pod> -c proxy -- cat /kong/declarative/kong.yml
# or wherever declarative_config points in kong.conf
# decK-managed environments — check the repo that's synced to Konnect / Kong
# Look for: kong.yml, deck.yaml, or a GitOps folder like kong/config/
# Kong Ingress Controller (KIC) — routes/services are Kubernetes resources
kubectl get kongingress,kongplugin,kongconsumer -A
kubectl get ingress -A
kubectl get httproute,gateway -A # Gateway API
```
### Verify the route actually matches your request
Kong matches routes by `hosts`, `paths`, `methods`, `headers`, `snis` — all specified conditions must match. Cross-check the request against the declarative entry:
```yaml
# Example entry in kong.yml
services:
- name: my-service
url: http://my-upstream.internal:8080
routes:
- name: my-route
hosts: ["api.example.com"]
paths: ["/v1/orders"]
strip_path: true
methods: ["GET", "POST"]
protocols: ["https"]
```
Things to verify against the live request:
- **`hosts`** — does it match the `Host` header you're sending? (including/excluding port)
- **`paths`** — is `strip_path: true/false` set correctly? If `strip_path: true`, Kong strips the matched path prefix before forwarding.
- **`protocols`** — does it include `http` and/or `https` as needed?
- **`methods`** — is your HTTP method listed?
- **Route priority** — if multiple routes could match, Kong picks the most specific one. Longer paths and explicit hosts win.
### Use response headers to confirm matching
On a successful match you'll see `X-Kong-Route-Id` and `X-Kong-Service-Id` in the response (§1). On a 404, these are **absent** — Kong saw the request but no route claimed it. On a match where the upstream fails, both are present plus an upstream error — that tells you routing worked, the upstream didn't.
### Test the upstream target directly from inside Kong's network
```bash
# Kubernetes
kubectl exec -n <ns> <kong-pod> -c proxy -- curl -sv http://<service-host>:<port>/<path>
# Docker
docker exec kong curl -sv http://<service-host>:<port>/<path>
```
If this fails from inside Kong but works from your laptop, it's a network/DNS problem between Kong and upstream, not a Kong config issue.
---
## 4. Upstream Connectivity (502 / 503 / 504)
These errors mean Kong matched a route/service but couldn't get a good response from upstream.
### Read the headers and log line together
```bash
# Header tells you upstream latency — if it's close to read_timeout, you've found the 504
curl -sv https://<proxy>/<path> 2>&1 | grep -i "X-Kong-Upstream-Latency"
# Grab X-Kong-Request-Id from the response and correlate with error log
kubectl logs -n <ns> deploy/kong -c proxy | grep "<request-id>"
```
### Timeout interpretation
| Error | Likely cause |
|-------|-------------|
| 502 immediately | Upstream refused connection / wrong port / pod not ready |
| 502 after delay | Upstream dropped connection mid-response |
| 504 | Upstream took longer than `read_timeout` (default: 60s) |
| 503 | All targets unhealthy or passive health check circuit-breaker open |
To change timeouts without Admin API, edit the declarative config (`connect_timeout`, `read_timeout`, `write_timeout` on the service) and re-sync (decK sync, GitOps push, KIC annotation update, or Konnect publish).
### DNS resolution from Kong's perspective
```bash
# Inside the Kong container/pod
kubectl exec -n <ns> <kong-pod> -c proxy -- nslookup <upstream-hostname>
kubectl exec -n <ns> <kong-pod> -c proxy -- getent hosts <upstream-hostname>
# For Kong's own resolver view, check logs for DNS errors (§6)
```
Common DNS issues on Kong:
- Upstream uses a `.svc.cluster.local` name but Kong's DNS resolver doesn't include the cluster DNS
- Short-TTL record — Kong caches DNS per its `dns_*` settings in `kong.conf`; stale entries after failover
- IPv6/IPv4 mismatch — upstream resolves to AAAA but Kong prefers A (or vice versa)
### Targets / active health checks
If the upstream is a Kong **upstream** with multiple targets, passive health checks may have marked targets unhealthy. Without the Admin API, signals you still have:
- Error log: `no live upstreams while connecting to upstream` → all targets are currently marked down
- Error log: `upstream SERVER temporarily disabled` → passive health check just ejected a target
- Restart Kong (last-resort) to reset the in-memory health state — useful if upstreams recovered but Kong hasn't noticed
---
## 5. Plugin Debugging
Without the Admin API, you debug plugins via: declarative config + response headers + logs + surgical edits to the config.
### Find which plugins apply
In your declarative config, plugins can be attached at four scopes (global → service → route → consumer, most specific wins):
```yaml
plugins: # global
- name: rate-limiting
config: {minute: 100, policy: local}
services:
- name: my-service
plugins: # service-scoped
- name: key-auth
routes:
- name: my-route
plugins: # route-scoped
- name: cors
```
For KIC: check `KongPlugin` / `KongClusterPlugin` CRDs and `konghq.com/plugins` annotations on Ingresses/Services.
### Plugin signals in response headers
- `WWW-Authenticate` on 401 → auth plugin rejected the request; the realm/scheme tells you which one
- `X-RateLimit-*` headers → rate-limiting plugin is active; check the `Remaining` values
- `Access-Control-*` headers → CORS plugin is active (or should be, if missing)
- `X-Kong-Proxy-Latency` unusually high → an expensive plugin (e.g., external auth callback) is running
### Common plugin-specific issues
**JWT / Key Auth (401 Unauthorized)**
- Is the credential header name right? (`apikey`, `Authorization`, custom)
- Is the consumer still present in the declarative config? (Credential without matching consumer = silent 401)
- For JWT: is the `iss` claim an exact match for a registered `jwt_secrets.key`?
**Rate Limiting (429 Too Many Requests)**
- `X-RateLimit-Limit-<period>` vs `X-RateLimit-Remaining-<period>` in the headers
- `policy` setting: `local` means each Kong node counts independently (common cause of "429 but I only made 1 request" on multi-node clusters)
- `limit_by: consumer` but consumer isn't authenticated yet → limit applies per IP instead
**CORS (browser blocked)**
- Always test the **OPTIONS preflight** explicitly; browsers fail silently on a bad preflight:
```bash
curl -sv -X OPTIONS https://<proxy>/<path> \
-H "Origin: https://yourfrontend.com" \
-H "Access-Control-Request-Method: POST"
```
- Verify the declarative `origins`, `methods`, `headers`, `credentials`, `max_age` values
**Request/Response Transformer**
- Read the declarative `add`/`remove`/`replace`/`rename` rules and apply them by hand against a sample request to see what's happening
### Surgical isolation
When you suspect a specific plugin: comment it out of the declarative config, re-sync, retest. If the symptom disappears, the plugin or its config is the culprit. Revert after confirming.
---
## 6. Log Analysis (Primary Diagnostic Tool)
**Always query logs from the Loki endpoint.** The platform ships Kong's stdout to Loki, and Loki is the authoritative log source — it covers every DP across the fleet, indexes by label, and works without pod/container shell access. `kubectl logs` and `docker logs` are fallbacks only (local dev, or Loki itself is down).
### Query Loki first (LogQL)
Hit Loki's HTTP API (`/loki/api/v1/query_range` for time-bounded, `/loki/api/v1/query` for instant). Use `logcli` or plain curl + jq:
```bash
LOKI=<your-loki-url> # e.g. http://loki.monitoring.svc:3100
SINCE="1h" # adjust to the incident window
# 1. Any Kong errors in the window — broad first pass
logcli --addr="$LOKI" query --since="$SINCE" \
'{app="kong", container="proxy"} |= "[error]"'
# 2. Narrow to a specific DP / environment / service
logcli --addr="$LOKI" query --since="$SINCE" \
'{app="kong", env="staging", container="proxy"} |= "[error]"'
# 3. Correlate with a specific request via X-Kong-Request-Id
REQ_ID=$(curl -s -o /dev/null -D - https://<proxy>/<path> \
| awk -F': ' '/X-Kong-Request-Id/ {print $2}' | tr -d '\r')
logcli --addr="$LOKI" query --since="$SINCE" \
"{app=\"kong\"} |= \"$REQ_ID\""
# 4. Upstream connection errors (aggregated across the fleet)
logcli --addr="$LOKI" query --since="$SINCE" \
'{app="kong"} |~ "(connect\\(\\) failed|upstream timed out|no live upstreams|upstream prematurely closed)"'
# 5. Rate of errors per DP pod over time (for a dashboard-style view)
logcli --addr="$LOKI" query-range --since="$SINCE" --step=1m \
'sum by (pod) (rate({app="kong"} |= "[error]" [1m]))'
```
If `logcli` isn't installed, the equivalent curl:
```bash
curl -sG "$LOKI/loki/api/v1/query_range" \
--data-urlencode 'query={app="kong"} |= "[error]"' \
--data-urlencode "start=$(date -u -v-1H +%s)000000000" \
--data-urlencode "end=$(date -u +%s)000000000" \
--data-urlencode "limit=500" | jq '.data.result'
```
### Useful LogQL patterns for Kong
| What you want | LogQL |
|---|---|
| Errors for one service | `{app="kong"} \|= "service=\"<service-name>\""` \| `\|= "[error]"` |
| One request end-to-end | `{app="kong"} \|= "<X-Kong-Request-Id>"` |
| Upstream failures | `{app="kong"} \|~ "connect\\(\\) failed\|upstream timed out\|no live upstreams"` |
| Plugin errors | `{app="kong"} \|~ "error.*plugin\|plugin.*error\|failed to run plugin"` |
| DNS failures | `{app="kong"} \|~ "dns\|resolver\|could not be resolved"` |
| TLS/SSL issues | `{app="kong"} \|~ "ssl\|tls\|certificate\|handshake"` |
| Lua errors | `{app="kong"} \|~ "lua.*error\|stack traceback\|runtime error"` |
| Access log for one path | `{app="kong", container="proxy"} \|= "\"/v1/orders\""` |
| Error rate per pod | `sum by (pod) (rate({app="kong"} \|= "[error]" [1m]))` |
Label names (`app`, `env`, `container`, `pod`) may differ in this platform — adjust to whatever the scrape config actually attaches. If you're unsure, run `logcli --addr="$LOKI" labels` to list them, and `logcli --addr="$LOKI" labels <name>` to see values.
### Fallback: only if Loki is unreachable or you're on a local dev box
```bash
# Docker / stdout
docker logs kong 2>&1 | tail -200
# Kubernetes
kubectl logs -n <ns> deploy/kong -c proxy --tail=500
kubectl logs -n <ns> deploy/kong -c proxy -f # live tail
# File-based (older setups)
tail -n 200 /usr/local/kong/logs/error.log
tail -n 200 /usr/local/kong/logs/access.log
# Then grep as usual:
# "connect() failed|upstream timed out|no live upstreams|upstream prematurely closed"
# "error.*plugin|plugin.*error|failed to run plugin"
# "dns|resolver|could not be resolved|name resolution"
# "ssl|tls|certificate|handshake|verify"
# "lua.*error|stack traceback|runtime error"
```
### Correlate with the failing request
The `X-Kong-Request-Id` response header appears in Kong's error log lines — that's the single best key to zero in on one request's full trace (use it in LogQL above, or grep through `kubectl logs` output if falling back).
### Analyze the log output and report back to the user
Running the query is step 1; the value is in what you make of the result. After every log query, do this before moving on:
**1. Read the full result, not just the first line.**
Scan for repeated patterns (same error across many pods → systemic; one pod only → isolated instance), error bursts aligned with the incident window, and secondary errors that reveal the root cause (e.g. a DNS failure preceding a stream of `no live upstreams`).
**2. Classify each distinct error.** Map what you see to one of the known buckets so the user gets a named cause, not raw log text:
| Log signal | Likely category | Next step |
|---|---|---|
| `connect() failed`, `connection refused` | Upstream down / wrong port | §4 upstream connectivity |
| `upstream timed out` | Upstream slow (> `read_timeout`) | §4 timeout diagnosis |
| `no live upstreams while connecting to upstream` | All targets marked unhealthy | §4 targets / health checks |
| `upstream prematurely closed connection` | Upstream killed the connection mid-response | Upstream app logs, not Kong |
| `could not be resolved`, `dns server error` | DNS failure from Kong's resolver | §4 DNS section |
| `SSL_do_handshake() failed`, `certificate verify failed` | TLS/mTLS issue | §7 TLS |
| `failed to run '<plugin>' plugin`, `lua entry thread aborted` | Plugin error (often custom or misconfigured) | §5 plugins |
| `attempt to index a nil value`, `stack traceback` | Lua runtime error (custom plugin bug) | §5 plugins, disable suspect plugin |
| `[warn] ... using uninitialized ...` | Usually noise; ignore unless correlated with the incident | — |
| No matching error lines | Kong isn't logging an error — check upstream app logs, client side, or network layer | — |
**3. Count and correlate.**
- How many occurrences in the window? One-off vs. sustained?
- Which pods/instances? Fleet-wide or localized?
- Does the error rate from the log match the error rate you saw in Victoria Metrics (§2)? A mismatch is itself a signal (e.g. VM says 5xx but no log errors → error is happening before Kong gets to log it, or logging is broken).
- For access logs: what are `$upstream_response_time` vs `$request_time` telling you? (See the breakdown below.)
**4. Respond to the user with a structured summary.** Don't paste raw logs back — give them the analyzed picture in this shape:
> **Finding:** <one-sentence root-cause hypothesis>
> **Evidence:** <representative log line(s), trimmed — 1–3 lines max, with count and time range>
> **Scope:** <how many occurrences, which pods/services, start/end time>
> **Correlates with:** <matching VM metrics if you checked them, or "no matching 5xx rate in VM" as a negative finding>
> **Recommended next step:** <exact next playbook section + the specific check to run, e.g. "§4: run upstream health probe from inside kong pod">
If the logs don't support a confident conclusion, say so explicitly — "logs show X but that alone doesn't explain Y; recommend we also check Z" — rather than forcing a hypothesis. A clear "I don't know yet, here's what to check next" is more useful than an overconfident guess.
**5. Ask only if you need more signal.** If the log window is empty or ambiguous, ask for a tighter time window, a specific `X-Kong-Request-Id` from a reproduction, or permission to widen the query — don't dump raw logs on the user and ask them to interpret.
### Access log — latency breakdown
If access logging is enabled (via declarative `log_format` or the `file-log` / `http-log` plugins), the useful fields are:
```
$request_time # total time Kong spent on the request
$upstream_response_time # time spent waiting for upstream
$upstream_connect_time # time spent establishing upstream connection
```
Interpretation:
- `request_time` large, `upstream_response_time` large → upstream is slow
- `request_time` large, `upstream_response_time` small → plugins or Kong internals are slow
- `upstream_connect_time` large → network / DNS / TCP connect to upstream is slow
### Raise log verbosity via config (not Admin API)
Edit `kong.conf` (or the Helm values / env vars):
```
log_level = debug
```
Then restart Kong (`kubectl rollout restart deploy/kong` / `docker restart kong` / `systemctl restart kong`). **Revert after debugging** — debug logging is high-volume and will drown log aggregation in production.
---
## 7. TLS / Certificate Issues
The Admin API helps with this, but openssl and curl cover 90% of what you need from outside.
```bash
# Check expiry and SANs of the cert Kong serves
echo | openssl s_client -connect <kong-proxy>:443 -servername <hostname> 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer -ext subjectAltName
# Test TLS handshake end-to-end with verbose output
curl -sv https://<kong-proxy>/<path> 2>&1 | grep -iE "SSL|TLS|certificate|verify|subject|issuer"
# Test upstream TLS (if Kong → upstream is HTTPS)
kubectl exec -n <ns> <kong-pod> -c proxy -- \
openssl s_client -connect <upstream-host>:443 -servername <upstream-host> </dev/null
```
Common TLS problems:
- **SNI mismatch** — Kong has multiple certs bound to SNIs; the `Host` you're sending doesn't match any SNI, so Kong serves the default cert (which the client rejects)
- **Expired cert** — check dates with openssl above; also check any `cert-manager` or secret-rotation log events
- **mTLS to upstream** — Kong needs a client cert; verify `tls_verify`, `tls_verify_depth`, `ca_certificates`, and `client_certificate` in the declarative service config
- **Self-signed upstream** — Kong refuses with `upstream SSL certificate verify failed`; set `tls_verify: false` on the service (temporarily) or add the CA
---
## 9. Systematic Checklist for "My API Isn't Working"
1. **Does the response have `X-Kong-*` headers / `Via: kong/...`?**
No → request never reached Kong. Check DNS, LB, ingress in front of Kong before anything else.
2. **Is Kong running and not crash-looping?**
`kubectl get pods` / `docker ps` / `systemctl status`. Check logs for startup errors.
3. **Status endpoint healthy?** (`curl http://<kong>:8100/status`) — Kong is live and DB (if any) is reachable.
4. **If 404:** no route matched. Cross-check declarative `hosts`/`paths`/`methods`/`protocols` against the actual request.
5. **If 502/503/504:** route matched, upstream failed. Compare `X-Kong-Upstream-Latency` to `read_timeout`; test the upstream directly from inside the Kong pod.
6. **If 401/403:** identify the auth plugin from the declarative config, verify consumer and credentials exist, check the header/param name you're sending.
7. **Unexpected behavior at any status:** list every plugin at global/service/route scope from the declarative config; disable the suspect one and re-test.
8. **Grep error logs** with the `X-Kong-Request-Id` from the failing response — that request's full trace is in there.
9. **Still stuck?** Set `log_level = debug` in `kong.conf`, restart, reproduce, read logs, revert.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment