Skip to content

Instantly share code, notes, and snippets.

@samba
Last active August 28, 2025 21:54
Show Gist options
  • Save samba/c0e2ac0eda825b9e08041079e138641e to your computer and use it in GitHub Desktop.
Save samba/c0e2ac0eda825b9e08041079e138641e to your computer and use it in GitHub Desktop.
Kubernetes debug capture script

kubescan.sh - a debug capture script for Kubernetes clusters

Objectives:

  • Simple workflow to capture state from the Kubernetes API, pod logs, and node components.
  • Maximize coverage of fault modes in common cluster environments.
  • Maximize compatibility with user runtime environments (macOS & Linux).
  • Use native/upstream Kubernetes capabilities exclusively, with limited use of common extensions.
  • Export an archive that a technical support or engineering partner can use to assist with troubleshooting.

The intended user for this utility is a Kubernetes user with full access to their cluster and their node environment.

The use of node-shell requires the user be able to deploy privileged pods in the kube-system namespace.

The user should upload the resulting tar.gz file to a repository the technician can see, and/or transmit it through shared communication channels.

Usage

The authoritative flags/options are defined in the shell script function print_help.

Examples follow.

Simple capture and archive with defaults.

export KUBECONFIG=~/.kube/config  # you should define this explicitly.

bash kubescan.sh all

This will produce a tarball (.tar.gz file) compressed, containing the full capture logs and state from all supported scopes.

Filtering log output to remove sensitive data.

If a user needs to remove sensitive data from the log files or other outputs, the workflow can have a manual step injected, like so:

export KUBECONFIG=...

bash kubescan.sh dump

filter_logs ./kdiag.*/*.log         # user defined TODO
filter_state ./kdiag.*/*.yaml       # user defined TODO
filter_nodestat ./kdiag.*/*.txt     # user defined TODO

bash kubescan.sh archive

Restricting capture to specific scopes

A user can explicitly capture individual scopes like so:

export KUBECONFIG=...

bash kubescan.sh state      # capture only API state
bash kubescan.sh nodestat   # capture only node component states
bash kubescan.sh archive    # create the tarball

Capture Scope

To facilitate troubleshooting, this script by default captures a wide scope of artifacts for analysis.

  • API state: all resources of all kinds, including CRDs, in all namespaces; but excluding Secrets.
  • Pod logs: all logs of all pods on every node, in all namespaces.
  • Node component status: kernel version & modules; network configuration, routes, interfaces, and firewall rules; containers and pods reported by the CRI; filesystem usage; process tree.

Namespaces can be excluded from pod log capture, using the -N <namespace> option; this can be used multiple times to exclude multiple namespaces. This does not restrict API state capture as trouble-shooting of critical systems often requires inspection of components in various namespaces.

Log Capture Duration

The duration of log capture defaults to 6 hours. This can be overridden using the -d <duration> option. This option takes a format of "N{h|m|s}" for hours, minutes, and seconds respectively, for example 120m.

Skipping NotReady Nodes

In some failure modes, nodes will be unresponsive to log requests, e.g. if they've failed or crashed and kubelet is dead. This results in very long retry loops to poll pods that are unavailable. To work around this, the -R flag restricts the pod selection for log polling, to only nodes reporting Ready status.

Including Secrets

If you need to include Secrets in your API state capture, you can activate this using the -S option. Be aware that this may expose sensitive credentials wherever you transmit the archive.

Operational Behavior

To optimize running duration of this script, several scopes support parallel fetching. To limit the load on the API server, serialization is enforced between scopes, and some scopes have batch controls.

These groups represent the serial isolation of each function.

  • All global Kubernetes resources are fetched in parallel.
  • All namespaced Kubernetes resources are fetched in parallel.
  • All node component dumps are executed in parallel across all nodes.
  • Pod logs are captured in batches of 6 namespaces, streamed in parallel across all pods in each namespace.

Output Archive Format

The output archive by default is named kdiag.DATE.CLUSTER_CONTEXT.tar.gz.

It's content has the following structure:

  • diag.log the output of the debug capture process itself, enabling any capture problems to be identified.
  • podlog.*.log for all namespaces, logs streamed chronologically for all pods in the namespace.
  • nodestat.*.txt for all nodes, the output of several inspection commands.
  • podbinding.list list of all pods and their binding to cluster nodes.
  • {resourcekind},{scope}.yaml all resources of each Kubernetes Kind, where scope is global or namespaced.

Pod logs are streamed chronologically within namespaces, to ease debugging of cross-pod workload coordination errors.

Node component state is annotated with start/end comments, lines starting with #; ending comments are terminated with ==. This is intended to simplify extraction of key data in analysis scripts. Where possible, common system utilities present output in output that remains human-readable.

Testing Script Updates

Run the script against a non-production Kubernetes cluster. It should exit with code 0, and should not print any error lines.

The archive output should contain files as above. Contents of the output currently require manual review, as no automated validator currently exists.

  • diag.log will show any errors in command argument/execution, or their side effects.
  • Most files shouldn't be empty, for a healthy cluster.
  • Most errors resulting in bad kubectl commands will include the word "error the diagnostic log.

Terminal output should look about like this.

Test output
% time  bash ./kubescan.sh all -d 20m -R

# Capturing API state. Tue Aug 26 14:11:35 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
Reading global type: componentstatuses
Reading global type: namespaces
Reading global type: nodes
Reading global type: persistentvolumes
Reading global type: customresourcedefinitions.apiextensions.k8s.io
Reading global type: apiservices.apiregistration.k8s.io
Reading global type: clusterissuers.cert-manager.io
Reading global type: certificatesigningrequests.certificates.k8s.io
Reading global type: bgpconfigurations.crd.projectcalico.org
Reading global type: bgpfilters.crd.projectcalico.org
Reading global type: ingressclasses.networking.k8s.io
Reading global type: nodefeaturerules.nfd.k8s-sigs.io
Reading global type: runtimeclasses.node.k8s.io
Reading global type: numatopologies.nodeinfo.volcano.sh
Reading global type: clusterpolicies.nvidia.com
Reading global type: nvidiadrivers.nvidia.com
Reading global type: apiservers.operator.tigera.io
Reading global type: imagesets.operator.tigera.io
Reading global type: installations.operator.tigera.io
Reading global type: tigerastatuses.operator.tigera.io
Reading global type: bgpconfigurations.projectcalico.org
Reading global type: bgpfilters.projectcalico.org
Reading global type: bgppeers.projectcalico.org
Reading global type: blockaffinities.projectcalico.org
Reading global type: caliconodestatuses.projectcalico.org
Reading global type: clusterinformations.projectcalico.org
Reading global type: felixconfigurations.projectcalico.org
Reading global type: globalnetworkpolicies.projectcalico.org
Reading global type: globalnetworksets.projectcalico.org
Reading global type: hostendpoints.projectcalico.org
Reading global type: ipamconfigurations.projectcalico.org
Reading global type: ippools.projectcalico.org
Reading global type: ipreservations.projectcalico.org
Reading global type: kubecontrollersconfigurations.projectcalico.org
Reading global type: profiles.projectcalico.org
Reading global type: clusterrolebindings.rbac.authorization.k8s.io
Reading global type: clusterroles.rbac.authorization.k8s.io
Reading global type: priorityclasses.scheduling.k8s.io
Reading global type: csidrivers.storage.k8s.io
Reading global type: csinodes.storage.k8s.io
Reading global type: storageclasses.storage.k8s.io
Reading global type: volumeattachments.storage.k8s.io
Reading global type: clusterpolicyreports.wgpolicyk8s.io
# ... <snip>...
(waiting for global API capture to complete...)
Reading namespaced type: configmaps
Reading namespaced type: endpoints
Reading namespaced type: events
Reading namespaced type: limitranges
Reading namespaced type: persistentvolumeclaims
Reading namespaced type: pods
Reading namespaced type: podtemplates
Reading namespaced type: replicationcontrollers
Reading namespaced type: resourcequotas
Reading namespaced type: secrets
Reading namespaced type: serviceaccounts
Reading namespaced type: services
Reading namespaced type: challenges.acme.cert-manager.io
Reading namespaced type: orders.acme.cert-manager.io
Reading namespaced type: controllerrevisions.apps
Reading namespaced type: daemonsets.apps
Reading namespaced type: deployments.apps
Reading namespaced type: replicasets.apps
Reading namespaced type: statefulsets.apps
Reading namespaced type: horizontalpodautoscalers.autoscaling
Reading namespaced type: cronjobs.batch
Reading namespaced type: jobs.batch
Reading namespaced type: jobs.batch.volcano.sh
Reading namespaced type: commands.bus.volcano.sh
Reading namespaced type: certificaterequests.cert-manager.io
Reading namespaced type: certificates.cert-manager.io
Reading namespaced type: issuers.cert-manager.io
Reading namespaced type: leases.coordination.k8s.io
Reading namespaced type: networkpolicies.crd.projectcalico.org
Reading namespaced type: networksets.crd.projectcalico.org
Reading namespaced type: endpointslices.discovery.k8s.io
Reading namespaced type: events.events.k8s.io
Reading namespaced type: jobflows.flow.volcano.sh
Reading namespaced type: jobtemplates.flow.volcano.sh
Reading namespaced type: network-attachment-definitions.k8s.cni.cncf.io
Reading namespaced type: opentelemetrycollectors.opentelemetry.io
Reading namespaced type: poddisruptionbudgets.policy
Reading namespaced type: networkpolicies.projectcalico.org
Reading namespaced type: networksets.projectcalico.org
Reading namespaced type: rolebindings.rbac.authorization.k8s.io
Reading namespaced type: roles.rbac.authorization.k8s.io
Reading namespaced type: nodeslicepools.whereabouts.cni.cncf.io
Reading namespaced type: overlappingrangeipreservations.whereabouts.cni.cncf.io
# ... <snip>...
(waiting for namespaced API capture to complete...)
# Capturing node-local status. Tue Aug 26 14:11:46 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
(waiting for node-local status capture...)
# Capturing pod logs. Tue Aug 26 14:11:47 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
Fetching logs from namespace auth (current)
Fetching logs from namespace auth (previous)
Fetching logs from namespace default (current)
Fetching logs from namespace default (previous)
Fetching logs from namespace calico-apiserver (current)
Fetching logs from namespace calico-apiserver (previous)
Fetching logs from namespace calico-system (current)
Fetching logs from namespace calico-system (previous)
Fetching logs from namespace cert-manager (current)
Fetching logs from namespace cert-manager (previous)
Fetching logs from namespace clusterclass (current)
Fetching logs from namespace clusterclass (previous)
Fetching logs from namespace csi-lvm (current)
Fetching logs from namespace csi-lvm (previous)
(waiting for namespace log streams [6/30] ...)
# ... <snip>...
(waiting for namespace log streams [30/30] ...)
(waiting for log capture to complete...)
/tmp/klogs.ns.QXN0N2
tar: Removing leading '/' from member names
bash ./kubescan.sh all  28.94s user 19.98s system 117% cpu 41.802 total

Debugging

It's often useful to add -x to the bash command arguments, when tracing errors in this script.

bash -x kubescan.sh ...

Use the available verbs to target specific code paths. (logs, state, nodestat...)

Compatibility

Currently tested on MacOS with bash 3.2.57, kubectl 1.33, Kubernetes API 1.31.

Shell utility command arguments are validated against GNU/Linux options on the equivalent commands. Shell utilities provided by coreutils Homebrew package version 9.7 are expected to be Linux-compatible.

#!/usr/bin/env bash
# A quick script to capture the state of a Kubernetes cluster, and logs of all pods, for analysis.
# The function print_help shows correct usage.
set -euf -o pipefail
# A random string label used to *exclude* an empty set -- implicitly including all real pods
export POD_LABEL_EXCLUDE="diagnostic/exclude!=$(dd if=/dev/urandom bs=1024 | base64 | head -c 32 | tr -d '/+=\r')"
# List of external progams required...
export DEPEND_PROGS="kubectl grep date base64 tar gzip dd head find xargs tee tr awk wc cut sort uniq"
# Limit the simultaneous number of namespace log streams.
# Each namespace counts as 1.
# NB: total streams is actually multiplied by 2, for fetching current and previous pod iterations.
export LOGSTREAM_LIMIT=${LOGSTREAM_LIMIT:-6}
export ERR_BAD_OPTIONS=2
export ERR_MISSING_DEPEND=3
export ERR_CAPTURE_FAIL=4
# Dependencies are expected to provide GNU/POSIX compatibility.
# BSD options are avoided to ensure usability on Linux
# GNU-only options are avoided to ensure usability on MacOS
check_deps () {
local count=0
for k in ${DEPEND_PROGS} ; do
command -v "${k}" >/dev/null && continue
echo "Missing dependency: ${k}" >&2
count=$((count + 1))
done
command -v md5sum > /dev/null || command -v md5 > /dev/null || count=$((count +1))
kubectl node-shell -v > /dev/null || count=$((count +1))
return ${count}
}
fail () {
echo "$2" >&2
exit "$1"
}
get_kcontext () {
kubectl --kubeconfig="${KUBECONFIG}" config current-context
}
kgyaml () {
kubectl --kubeconfig="${KUBECONFIG}" -o yaml "${@}"
}
yaml_doc () {
echo '%YAML 1.1'
echo '--- #' "${1}"
cat
echo '...'
}
kgapireso () { # Fetch all resource kinds
kubectl --kubeconfig="${KUBECONFIG}" api-resources --verbs=list --namespaced="${1:-true}" -o name
}
kgnodereadiness () { # List nodes with ready status
kubectl --kubeconfig="${KUBECONFIG}" get nodes --output jsonpath="{range .items[*]}{.metadata.name} {range .status.conditions[?(@.type=='Ready')]}{'\t'}{@.status}{end}{'\n'}{end}"
}
kgnodenotready () { # List only NotReady nodes
kgnodereadiness | awk '{ if($2 != "True"){ print $1 } }'
}
kgnodeready () { # List only Ready nodes
kgnodereadiness | awk '{ if($2 == "True"){ print $1 } }'
}
kgnodelist_all () { # List all nodes regardless of Ready status
kgnodereadiness | awk '{ print $1 }'
}
klist_podnodebinding () { # List pods in every namespace and their bound node (or <none>)
kubectl --kubeconfig="${KUBECONFIG}" get pods --all-namespaces -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,NODE:.spec.nodeName' --no-headers
}
klistns () { # List all namespaces in a line-separated sequence
kubectl --kubeconfig="${KUBECONFIG}" get namespaces -o custom-columns='NS:.metadata.name' --no-headers
}
# func localdebugscript is executed in a node-shell on each node host namespace.
# as this function is piped to a remote shell, lines should be terminated explicitly,
# to avoid misinterpretation of any whitespace compression.
# output from this function is deliminated by headers to make manual inspection easier,
# and output is adjusted in a few cases to simplify parsing in scripts.
# NB: this assumes an Ubuntu/Debian-based host OS using containerd and systemd networking
localdebugscript () {
echo "# host name" ; hostname ; echo "# end host name ==" ;
echo "# host kernel" ; uname -a ; echo "# end host kernel ==" ;
echo "# host kernel modules"; lsmod ; echo "# end host kernel modules ==" ;
echo "# host network interfaces" ; ip addr; echo "# end host network interfaces ==" ;
echo "# host filesystems" ; df -hT | grep -e xfs -e ext4 -e vfat ; echo "# end host filesystems ==" ;
echo "# host process tree" ; ps auxf ; echo "# end host process tree ==" ;
echo "# host memory" ; free -m ; echo "# end host memory ==" ;
echo "# host routing table" ; route -v ; echo "# end host routing table == " ;
echo "# host network config" ; networkctl status -a ; echo "# end host network config ==" ;
echo "# host iptables" ; iptables-save ; echo "# end host iptables ==" ;
echo "# host nftables" ; nft list tables ; echo "# end host nftables ==";
echo "# host nftables ruleset" ; nft list ruleset ; echo "# end host nftables ruleset ==" ;
echo "# host pod list" ;
crictl pods ;
echo "# end host pod list ==" ;
echo "# host pod details" ;
crictl pods | tail -n +2 | awk '{ print $1 }' | xargs crictl inspectp | sed 's@}{@};\n{@g';
echo "# end host pod details ==" ;
echo "# host container list" ;
crictl ps;
echo "# end host container list ==" ;
echo "# host container details ==" ;
crictl ps | tail -n +2 | awk '{ print $1 }' | xargs crictl inspect | sed 's@}{@};\n{@g' ;
echo "# end host container details ==" ;
echo "# host package list ==" ;
dpkg -l ;
echo "# end host package list==" ;
}
pass_remote_script_call () {
# for input to a remote shell, kinda like eval()
declare -f localdebugscript && echo -e "\nlocaldebugscript"
}
knode_dumplocal() { # Run the local debug script on a given node
kubectl node-shell -v || return $?
if ! pass_remote_script_call | kubectl node-shell -n kube-system "${1}" -- bash; then
echo -e "\nERROR: Node \"${1}\" failed to respond to diagnostic scrape. (err:${ERR_CAPTURE_FAIL}, node=${1})\n" >&2
fi
}
label_excluded_pods () { # Apply or remove the exclusion label to/from all pods on nodes which are in NotReady status.
local do_label=NONE
case "${1}" in
add) do_label="$(echo "${POD_LABEL_EXCLUDE}" | tr -d '!')" ;;
remove) do_label="$(echo "${POD_LABEL_EXCLUDE}" | grep -oE '^([^=]*)=' | tr -d '!=')-" ;;
esac
while read -r node ; do
echo "# Excluding pods from node: ${node}"
kubectl --kubeconfig="${KUBECONFIG}" label pods --overwrite=true --all-namespaces --field-selector=spec.nodeName="${node}" "${do_label}"
done
}
capture_node_state_justone () {
# If the pod cannot be started, a time out error will be logged;
# in that case, print the node name to the list of faulty nodes
echo "# Starting node-local scrape: ${1}" > "${2}"
knode_dumplocal "${1}" >> "${2}" 2>>"${2}"
}
capture_node_state () { # For each node name on <stdin>, capture local state into a node-specific file
echo "# Capturing node-local status. $(date) $(get_kcontext)" | tee -a "${1}/diag.log" >&2
while read -r node; do
capture_node_state_justone "${node}" "${1}/nodestat.${node}.txt" &
done
echo "(waiting for node-local status capture...)" >&2
wait
}
extract_dead_node_list () {
# extract the list of nodes that failed to scrape local
echo "# Capturing node list of nodes failing to run local diagnostic capture." >&2
find "${1}" -type f -name 'nodestat.*.txt' | \
(xargs grep '^ERROR: Node' || true) | \
(grep -oE '\(err:[0-9]+, node=(.*)\)$' || true) | \
tr -d ')' | cut -d = -f 2
}
filter_resource_types () { # Exclude Secrets by default.
declare -a exclude_types
test "${INCLUDE_SECRETS:-false}" = "true" || exclude_types+=('-e secret')
# No excludes -- just pass through
test "${#exclude_types[@]}" -lt 1 && cat && return
# filter out excluded types
grep -v -i "${exclude_types[@]}"
}
dump_node_state () { # Capture node state from ready/not-ready nodes per exclusion option
# args: <outputdir> <skip-notready> <node-list>
case "${2}" in # NB: true = skip not-ready nodes, use only ready nodes
true) kgnodeready | capture_node_state "${1}";;
false) kgnodelist_all | capture_node_state "${1}";;
esac
}
dump_all_k8s_resources () { # Capture all API state from available resource kinds (native & CRD)
echo "# Capturing API state. $(date) $(get_kcontext)" | tee -a "${1}/diag.log" >&2
klist_podnodebinding > "${1}/podbinding.list"
while read -r t; do
echo "Reading global type: ${t}" | tee -a "${1}/diag.log" >&2
kgyaml get "${t}" 2>>"${1}/diag.log" | yaml_doc "${t}" > "${1}/${t},global.yaml" &
done < <(kgapireso false)
echo "(waiting for global API capture to complete...)" >&2
wait
while read -r t; do
echo "Reading namespaced type: ${t}" | tee -a "${1}/diag.log" >&2
kgyaml get "${t}" --all-namespaces 2>>"${1}/diag.log" | yaml_doc "${t}" > "${1}/${t},namespaced.yaml" &
done < <(kgapireso true | filter_resource_types)
echo "(waiting for namespaced API capture to complete...)" >&2
wait
}
dump_all_podlogs () { # Capture current and previous logs from all pods in all namespaces (allowing exclusions)
local outpath="${1}" ;
local duration="${2:-2h}" ;
shift 2;
local counter=0
local maxpending=${LOGSTREAM_LIMIT}
local nsfile="$(mktemp '/tmp/klogs.ns.XXXXXX')"
echo "# Capturing pod logs. $(date) $(get_kcontext)" | tee -a "${outpath}/diag.log" >&2
klistns "${@}" > "${nsfile}" # List namespaces w/ exclusion filter
totalns=$(wc -l "${nsfile}" | awk '{ print $1 }' )
while read -r ns; do # For each namespace
for p in false true; do # Current & previous pod instances
suffix=0
test "${p}" = "false" && suffix="current"
test "${p}" = "true" && suffix="previous"
echo "Fetching logs from namespace ${ns} (${suffix})" | tee -a "${outpath}/diag.log" >&2
kubectl logs -n "${ns}" --kubeconfig="${KUBECONFIG}" \
--max-log-requests=100 \
--selector="${POD_LABEL_EXCLUDE}" \
--since="${duration}" \
--all-containers=true \
--all-pods=true \
--prefix=true \
--timestamps=true \
--pod-running-timeout=30s \
--ignore-errors=true \
--request-timeout=30s \
--previous=${p} 2>>"${outpath}/diag.log" >"${outpath}/podlog.${ns}.${suffix}.log" &
done
counter=$((counter+1))
test 0 -eq $((counter % maxpending)) && echo "(waiting for namespace log streams [${counter}/${totalns}] ...)" >&2 && wait
done < "${nsfile}"
echo "(waiting for log capture to complete...)" >&2
wait
# clean temp file
rm -rvf "${nsfile}" | tee -a "${outpath}/diag.log" >&2
}
_md5sum () { # prints a command format for compatibility, to produce similar MD5 checksum output on MacOS and Linux.
command -v md5sum > /dev/null && echo md5sum "${@:-}" && return
command -v md5 > /dev/null && echo md5 -r "${@:-}" && return
}
make_archive () { # Checksum the files in a directory, then create a tarball, and checksum the tarball.
test -d "${1}" || fail 31 "Not a directory: ${1}"
find "${1}" -type f -name '*.log' -o -name '*.yaml' -print0 | xargs -0 $(_md5sum) > "${1}/MD5SUMS"
tar -czf "${1}.tar.gz" "${1}"
$(_md5sum) "${1}.tar.gz" > "${1}.tar.gz.md5"
}
print_help () {
cat <<EOF
Usage: ${0} <verb> [options] [flags]
Verbs
all - does everything below in the right order
dump - dumps everything (logs & state) to a directory, without archiving
state - dumps API state to directory, without orchiving
nodestat - dumps node status to a directory, without archiving
logs - dumps pod logs to a directory, without archiving
archive - makes an archive of a directory
Options
-o <directory> Dump output to this path. By default this will be auto-generated, with a prefix "kdiag".
-i <directory> Only for archiving; the directory of inputs, produced by a previous dump action.
-d <duration> Extract a specific time duration from logs, e.g. "12h", "120m". Units are h, m, or s.
-N <namespace> Exclude a namespace from log collection; can be used multiple times.
Default log capture duration is 6 hours.
Flags
-S Include secrets in the API state output. This is disabled by default.
-R Skip pods which are scheduled on nodes which are NotReady. By default, all nodes will be included.
This script requires a variety of common shell utilities to be present:
${DEPEND_PROGS}
For MacOS/BSD, also "md5"; and for Linux, "md5sum".
Node-local diagnostics also require the use of the kubectl plugin "node-shell";
https://github.com/kvaps/kubectl-node-shell
Most users will be well-served by simply running:
${0} all
This produces an archive (tar.gz) that can be shared with a technician for analysis of your Kubernetes cluster.
This script REQUIRES that the environment variable KUBECONFIG be set explicitly, to ensure that diagnostic dumps
are captured from the desired cluster.
The current context of your KUBECONFIG is:
KUBECONFIG=${KUBECONFIG}
context: $(get_kcontext)
EOF
}
main () {
local DIRECTORY_OUT=NONE DIRECTORY_IN=NONE DURATION=6h INCLUDE_SECRETS=false
local SKIP_BAD_NODES=false
local verb="${1:-NONE}" ; shift ;
local -a exclude_namespaces=("-e NONE")
while getopts ':o:i:d:N:SR' OPT ; do
case $OPT in
o) DIRECTORY_OUT=${OPTARG} ;;
i) DIRECTORY_IN=${OPTARG} ;;
d) DURATION=${OPTARG} ;;
S) INCLUDE_SECRETS=true ;;
N) exclude_namespaces+=("-e '^${OPTARG}\$'") ;;
R) SKIP_BAD_NODES=true ;;
*) print_help && check_deps ; exit $? ;;
esac
done
case ${verb} in
all|dump|state|logs|archive) : ;;
*) print_help && check_deps ; exit $? ;;
esac
test -z "${KUBECONFIG}" && fail "${ERR_BAD_OPTIONS}" "Environment var KUBECONFIG is not defined; this must refer to your current kubectl configuration file"
test -z "$(get_kcontext)" && fail "${ERR_BAD_OPTIONS}" "No kubectl context is currently defined; do you have clusters in your kubeconfig?"
# Fail verbosely and early when dependencies are missing
command -v kubectl >/dev/null || fail "${ERR_MISSING_DEPEND}" "Missing critical dependency: kubectl"
kubectl node-shell -v >/dev/null || fail "${ERR_MISSING_DEPEND}" "Missing critical dependency: kubectl plugin node-shell"
check_deps || fail "${ERR_MISSING_DEPEND}" "Missing dependencies; required: ${DEPEND_PROGS}."
test "NONE" = "${DIRECTORY_OUT}" && DIRECTORY_OUT="${PWD}/${DIRPREFIX:-kdiag}.$(date +%Y%m%d).$(get_kcontext)"
test "NONE" = "${DIRECTORY_IN}" && DIRECTORY_IN="${DIRECTORY_OUT}"
case "${DURATION:${#DURATION}-1:1}" in
h|m|s) : ;;
*) fail "${ERR_BAD_OPTIONS}" "Invalid duration; must be <integer>{h|m|s}." ;;
esac
case "${verb}" in
dump|logs|state|nodestat|all)
mkdir -p "${DIRECTORY_OUT}" ;;
esac
echo "# Diagnostic script started. $(date) $(hostname)" >> "${DIRECTORY_OUT}/diag.log"
echo "# Options: action=${verb} duration=${DURATION} skip_notready=${SKIP_BAD_NODES}" \
"secrets=${INCLUDE_SECRETS} namespace=(${exclude_namespaces[@]})" >> "${DIRECTORY_OUT}/diag.log"
case "${verb}" in
dump|state|all)
dump_all_k8s_resources "${DIRECTORY_OUT}" ;;
esac
export FAIL_NODE_LIST="${DIRECTORY_OUT}/failnodes.txt"
case "${verb}" in
dump|all|nodestat)
dump_node_state "${DIRECTORY_OUT}" "${SKIP_BAD_NODES}" 2>&1 | tee -a "${DIRECTORY_OUT}/diag.log";
extract_dead_node_list "${DIRECTORY_OUT}" > "${FAIL_NODE_LIST}"
;;
esac
case "${verb}" in
dump|logs|all)
# Apply exclusion label to pods on Not-Ready nodes if the option is not set to skip
test true = "${SKIP_BAD_NODES}" && kgnodenotready >> "${FAIL_NODE_LIST}"
sort -u "${FAIL_NODE_LIST}" | label_excluded_pods add 2>&1 | tee -a "${DIRECTORY_OUT}/diag.log"
# Capture pod logs for all not-excluded pods
dump_all_podlogs "${DIRECTORY_OUT}" "${DURATION}" -v "${exclude_namespaces[@]}"
# Remove exclusion label from pods on Not-Ready nodes
sort -u "${FAIL_NODE_LIST}" | label_excluded_pods remove 2>&1 | tee -a "${DIRECTORY_OUT}/diag.log"
;;
esac
wait
case "${verb}" in
archive|all)
make_archive "${DIRECTORY_IN}" ;;
esac
}
main "${@:-}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment