I encountered an issue with a Redis StatefulSet in our staging environment where the pods were stuck in a crashloop. The root cause was a full Persistent Volume Claim (PVC) that had run out of space. Since this was cache data in a staging environment, I needed to clear the /data
folder to resolve the issue.
The challenge was that I couldn't exec into the pod directly because the Redis container was continuously crashlooping, making standard troubleshooting approaches impossible.
Here's how I resolved the issue by accessing the PVC data directly through the Kubernetes node filesystem.
First, I checked what was the PVC name for my crashing redis pod:
kubectl get po redis-replicas-0 -o yaml | grep persistentVolume -A 1
persistentVolumeClaim:
claimName: redis-data-redis-replicas-0
Then I got the volume name from the PVC:
$ kubectl get pvc redis-data-redis-replicas-0
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
redis-data-redis-replicas-0 Bound pvc-d7cfd03c-d389-4887-958c-0be96bbda095 8Gi RWO gp3 <unset> 66m
Next, I determined which Kubernetes node was hosting the problematic pod:
$ kubectl get po -o wide | grep replicas-0
redis-replicas-0 2/2 Running 14 (24m ago) 53m 10.15.114.123 ip-10-15-113-176.ec2.internal <none> <none>
The pod was running on node ip-10-15-113-176.ec2.internal
.
I created a debug pod on the specific node to access its filesystem:
$ kubectl debug node/ip-10-15-113-176.ec2.internal -it --image=ubuntu
Creating debugging pod node-debugger-ip-10-15-113-176.ec2.internal-6k9rt with container debugger on node ip-10-15-113-176.ec2.internal.
Using findmnt
, I located the mounted volume path for our specific PVC:
root@ip-10-15-113-176:/# findmnt | grep pvc-d7cfd03c-d389-4887-958c-0be96bbda095
| | |-/host/var/lib/kubelet/pods/3fa8e408-753a-493a-8df1-f77962535f3d/volumes/kubernetes.io~csi/pvc-d7cfd03c-d389-4887-958c-0be96bbda095/mount /dev/nvme4n1 ext4 rw,relatime,context=system_u:object_r:local_t:s0
Finally, I navigated to the mount point and removed the Redis data files:
root@ip-10-15-113-176:/# cd /host/var/lib/kubelet/pods/3fa8e408-753a-493a-8df1-f77962535f3d/volumes/kubernetes.io~csi/pvc-d7cfd03c-d389-4887-958c-0be96bbda095/mount
root@ip-10-15-113-176:/host/var/lib/kubelet/pods/.../mount# ls
appendonly.aof dump.rdb lost+found
root@ip-10-15-113-176:/host/var/lib/kubelet/pods/.../mount# rm appendonly.aof dump.rdb
This approach is particularly useful when:
- Pods are crashlooping and you can't exec into them
- You need to access PVC data directly for troubleshooting
- Standard kubectl commands aren't sufficient for the task
- You're certain the data can be safely deleted (like cache data)
- You're working in a non-production environment
- You understand the implications of data loss
Alternative Approaches: In production environments, consider:
- Scaling up the PVC if possible
- Creating data backups before cleanup
- Using Redis-specific commands to clear cache when the pod is accessible
- Implementing monitoring to prevent PVC space issues
When standard Kubernetes troubleshooting methods fail due to crashlooping pods, the node debug approach provides a powerful way to access and resolve PVC-related issues. This method saved significant time by allowing direct access to the Redis data files, ultimately resolving the crashloop and restoring service functionality.