Objectives:
- Simple workflow to capture state from the Kubernetes API, pod logs, and node components.
- Maximize coverage of fault modes in common cluster environments.
- Maximize compatibility with user runtime environments (macOS & Linux).
- Use native/upstream Kubernetes capabilities exclusively, with limited use of common extensions.
- Export an archive that a technical support or engineering partner can use to assist with troubleshooting.
The intended user for this utility is a Kubernetes user with full access to their cluster and their node environment.
The use of node-shell
requires the user be able to deploy privileged pods in the kube-system
namespace.
The user should upload the resulting tar.gz
file to a repository the technician can see, and/or transmit it through shared communication channels.
The authoritative flags/options are defined in the shell script function print_help
.
Examples follow.
export KUBECONFIG=~/.kube/config # you should define this explicitly.
bash kubescan.sh all
This will produce a tarball (.tar.gz
file) compressed, containing the full capture logs and state from all supported scopes.
If a user needs to remove sensitive data from the log files or other outputs, the workflow can have a manual step injected, like so:
export KUBECONFIG=...
bash kubescan.sh dump
filter_logs ./kdiag.*/*.log # user defined TODO
filter_state ./kdiag.*/*.yaml # user defined TODO
filter_nodestat ./kdiag.*/*.txt # user defined TODO
bash kubescan.sh archive
A user can explicitly capture individual scopes like so:
export KUBECONFIG=...
bash kubescan.sh state # capture only API state
bash kubescan.sh nodestat # capture only node component states
bash kubescan.sh archive # create the tarball
To facilitate troubleshooting, this script by default captures a wide scope of artifacts for analysis.
- API state: all resources of all kinds, including CRDs, in all namespaces; but excluding Secrets.
- Pod logs: all logs of all pods on every node, in all namespaces.
- Node component status: kernel version & modules; network configuration, routes, interfaces, and firewall rules; containers and pods reported by the CRI; filesystem usage; process tree.
Namespaces can be excluded from pod log capture, using the -N <namespace>
option; this can be used multiple times to exclude multiple namespaces.
This does not restrict API state capture as trouble-shooting of critical systems often requires inspection of components in various namespaces.
The duration of log capture defaults to 6 hours. This can be overridden using the -d <duration>
option.
This option takes a format of "N{h|m|s}" for hours, minutes, and seconds respectively, for example 120m
.
In some failure modes, nodes will be unresponsive to log requests, e.g. if they've failed or crashed and kubelet is dead.
This results in very long retry loops to poll pods that are unavailable.
To work around this, the -R
flag restricts the pod selection for log polling, to only nodes reporting Ready status.
If you need to include Secrets in your API state capture, you can activate this using the -S
option.
Be aware that this may expose sensitive credentials wherever you transmit the archive.
To optimize running duration of this script, several scopes support parallel fetching. To limit the load on the API server, serialization is enforced between scopes, and some scopes have batch controls.
These groups represent the serial isolation of each function.
- All global Kubernetes resources are fetched in parallel.
- All namespaced Kubernetes resources are fetched in parallel.
- All node component dumps are executed in parallel across all nodes.
- Pod logs are captured in batches of 6 namespaces, streamed in parallel across all pods in each namespace.
The output archive by default is named kdiag.DATE.CLUSTER_CONTEXT.tar.gz
.
It's content has the following structure:
diag.log
the output of the debug capture process itself, enabling any capture problems to be identified.podlog.*.log
for all namespaces, logs streamed chronologically for all pods in the namespace.nodestat.*.txt
for all nodes, the output of several inspection commands.podbinding.list
list of all pods and their binding to cluster nodes.{resourcekind},{scope}.yaml
all resources of each Kubernetes Kind, where scope is global or namespaced.
Pod logs are streamed chronologically within namespaces, to ease debugging of cross-pod workload coordination errors.
Node component state is annotated with start/end comments, lines starting with #
; ending comments are terminated with ==
.
This is intended to simplify extraction of key data in analysis scripts.
Where possible, common system utilities present output in output that remains human-readable.
Run the script against a non-production Kubernetes cluster. It should exit with code 0, and should not print any error lines.
The archive output should contain files as above. Contents of the output currently require manual review, as no automated validator currently exists.
diag.log
will show any errors in command argument/execution, or their side effects.- Most files shouldn't be empty, for a healthy cluster.
- Most errors resulting in bad kubectl commands will include the word "error the diagnostic log.
Terminal output should look about like this.
Test output
% time bash ./kubescan.sh all -d 20m -R
# Capturing API state. Tue Aug 26 14:11:35 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
Reading global type: componentstatuses
Reading global type: namespaces
Reading global type: nodes
Reading global type: persistentvolumes
Reading global type: customresourcedefinitions.apiextensions.k8s.io
Reading global type: apiservices.apiregistration.k8s.io
Reading global type: clusterissuers.cert-manager.io
Reading global type: certificatesigningrequests.certificates.k8s.io
Reading global type: bgpconfigurations.crd.projectcalico.org
Reading global type: bgpfilters.crd.projectcalico.org
Reading global type: ingressclasses.networking.k8s.io
Reading global type: nodefeaturerules.nfd.k8s-sigs.io
Reading global type: runtimeclasses.node.k8s.io
Reading global type: numatopologies.nodeinfo.volcano.sh
Reading global type: clusterpolicies.nvidia.com
Reading global type: nvidiadrivers.nvidia.com
Reading global type: apiservers.operator.tigera.io
Reading global type: imagesets.operator.tigera.io
Reading global type: installations.operator.tigera.io
Reading global type: tigerastatuses.operator.tigera.io
Reading global type: bgpconfigurations.projectcalico.org
Reading global type: bgpfilters.projectcalico.org
Reading global type: bgppeers.projectcalico.org
Reading global type: blockaffinities.projectcalico.org
Reading global type: caliconodestatuses.projectcalico.org
Reading global type: clusterinformations.projectcalico.org
Reading global type: felixconfigurations.projectcalico.org
Reading global type: globalnetworkpolicies.projectcalico.org
Reading global type: globalnetworksets.projectcalico.org
Reading global type: hostendpoints.projectcalico.org
Reading global type: ipamconfigurations.projectcalico.org
Reading global type: ippools.projectcalico.org
Reading global type: ipreservations.projectcalico.org
Reading global type: kubecontrollersconfigurations.projectcalico.org
Reading global type: profiles.projectcalico.org
Reading global type: clusterrolebindings.rbac.authorization.k8s.io
Reading global type: clusterroles.rbac.authorization.k8s.io
Reading global type: priorityclasses.scheduling.k8s.io
Reading global type: csidrivers.storage.k8s.io
Reading global type: csinodes.storage.k8s.io
Reading global type: storageclasses.storage.k8s.io
Reading global type: volumeattachments.storage.k8s.io
Reading global type: clusterpolicyreports.wgpolicyk8s.io
# ... <snip>...
(waiting for global API capture to complete...)
Reading namespaced type: configmaps
Reading namespaced type: endpoints
Reading namespaced type: events
Reading namespaced type: limitranges
Reading namespaced type: persistentvolumeclaims
Reading namespaced type: pods
Reading namespaced type: podtemplates
Reading namespaced type: replicationcontrollers
Reading namespaced type: resourcequotas
Reading namespaced type: secrets
Reading namespaced type: serviceaccounts
Reading namespaced type: services
Reading namespaced type: challenges.acme.cert-manager.io
Reading namespaced type: orders.acme.cert-manager.io
Reading namespaced type: controllerrevisions.apps
Reading namespaced type: daemonsets.apps
Reading namespaced type: deployments.apps
Reading namespaced type: replicasets.apps
Reading namespaced type: statefulsets.apps
Reading namespaced type: horizontalpodautoscalers.autoscaling
Reading namespaced type: cronjobs.batch
Reading namespaced type: jobs.batch
Reading namespaced type: jobs.batch.volcano.sh
Reading namespaced type: commands.bus.volcano.sh
Reading namespaced type: certificaterequests.cert-manager.io
Reading namespaced type: certificates.cert-manager.io
Reading namespaced type: issuers.cert-manager.io
Reading namespaced type: leases.coordination.k8s.io
Reading namespaced type: networkpolicies.crd.projectcalico.org
Reading namespaced type: networksets.crd.projectcalico.org
Reading namespaced type: endpointslices.discovery.k8s.io
Reading namespaced type: events.events.k8s.io
Reading namespaced type: jobflows.flow.volcano.sh
Reading namespaced type: jobtemplates.flow.volcano.sh
Reading namespaced type: network-attachment-definitions.k8s.cni.cncf.io
Reading namespaced type: opentelemetrycollectors.opentelemetry.io
Reading namespaced type: poddisruptionbudgets.policy
Reading namespaced type: networkpolicies.projectcalico.org
Reading namespaced type: networksets.projectcalico.org
Reading namespaced type: rolebindings.rbac.authorization.k8s.io
Reading namespaced type: roles.rbac.authorization.k8s.io
Reading namespaced type: nodeslicepools.whereabouts.cni.cncf.io
Reading namespaced type: overlappingrangeipreservations.whereabouts.cni.cncf.io
# ... <snip>...
(waiting for namespaced API capture to complete...)
# Capturing node-local status. Tue Aug 26 14:11:46 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
(waiting for node-local status capture...)
# Capturing pod logs. Tue Aug 26 14:11:47 PDT 2025 omniva-corp.teleport.sh-cody-fun-cluster
Fetching logs from namespace auth (current)
Fetching logs from namespace auth (previous)
Fetching logs from namespace default (current)
Fetching logs from namespace default (previous)
Fetching logs from namespace calico-apiserver (current)
Fetching logs from namespace calico-apiserver (previous)
Fetching logs from namespace calico-system (current)
Fetching logs from namespace calico-system (previous)
Fetching logs from namespace cert-manager (current)
Fetching logs from namespace cert-manager (previous)
Fetching logs from namespace clusterclass (current)
Fetching logs from namespace clusterclass (previous)
Fetching logs from namespace csi-lvm (current)
Fetching logs from namespace csi-lvm (previous)
(waiting for namespace log streams [6/30] ...)
# ... <snip>...
(waiting for namespace log streams [30/30] ...)
(waiting for log capture to complete...)
/tmp/klogs.ns.QXN0N2
tar: Removing leading '/' from member names
bash ./kubescan.sh all 28.94s user 19.98s system 117% cpu 41.802 total
It's often useful to add -x
to the bash command arguments, when tracing errors in this script.
bash -x kubescan.sh ...
Use the available verbs to target specific code paths. (logs
, state
, nodestat
...)
Currently tested on MacOS with bash 3.2.57, kubectl 1.33, Kubernetes API 1.31.
Shell utility command arguments are validated against GNU/Linux options on the equivalent commands.
Shell utilities provided by coreutils
Homebrew package version 9.7 are expected to be Linux-compatible.