Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save duboisf/eebba519e3c88bcd69c02422b55ee9f2 to your computer and use it in GitHub Desktop.
Save duboisf/eebba519e3c88bcd69c02422b55ee9f2 to your computer and use it in GitHub Desktop.

Troubleshooting Redis StatefulSet: Clearing Data from a PVC When Pods Are Crashlooping

The Problem

I encountered an issue with a Redis StatefulSet in our staging environment where the pods were stuck in a crashloop. The root cause was a full Persistent Volume Claim (PVC) that had run out of space. Since this was cache data in a staging environment, I needed to clear the /data folder to resolve the issue.

The challenge was that I couldn't exec into the pod directly because the Redis container was continuously crashlooping, making standard troubleshooting approaches impossible.

The Solution

Here's how I resolved the issue by accessing the PVC data directly through the Kubernetes node filesystem.

Step 1: Identify the Problematic PVC

First, I checked what was the PVC name for my crashing redis pod:

kubectl get po redis-replicas-0 -o yaml | grep persistentVolume -A 1
    persistentVolumeClaim:
      claimName: redis-data-redis-replicas-0

Then I got the volume name from the PVC:

$ kubectl get pvc redis-data-redis-replicas-0
NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
redis-data-redis-replicas-0   Bound    pvc-d7cfd03c-d389-4887-958c-0be96bbda095   8Gi        RWO            gp3            <unset>                 66m

Step 2: Locate the Node Running the Pod

Next, I determined which Kubernetes node was hosting the problematic pod:

$ kubectl get po -o wide | grep replicas-0
redis-replicas-0   2/2   Running   14 (24m ago)   53m   10.15.114.123   ip-10-15-113-176.ec2.internal   <none>   <none>

The pod was running on node ip-10-15-113-176.ec2.internal.

Step 3: Debug the Node to Access the PVC

I created a debug pod on the specific node to access its filesystem:

$ kubectl debug node/ip-10-15-113-176.ec2.internal -it --image=ubuntu
Creating debugging pod node-debugger-ip-10-15-113-176.ec2.internal-6k9rt with container debugger on node ip-10-15-113-176.ec2.internal.

Step 4: Find the Mounted Volume and Delete the Redis Data

Using findmnt, I located the mounted volume path for our specific PVC:

root@ip-10-15-113-176:/# findmnt | grep pvc-d7cfd03c-d389-4887-958c-0be96bbda095
| | |-/host/var/lib/kubelet/pods/3fa8e408-753a-493a-8df1-f77962535f3d/volumes/kubernetes.io~csi/pvc-d7cfd03c-d389-4887-958c-0be96bbda095/mount   /dev/nvme4n1   ext4   rw,relatime,context=system_u:object_r:local_t:s0

Finally, I navigated to the mount point and removed the Redis data files:

root@ip-10-15-113-176:/# cd /host/var/lib/kubelet/pods/3fa8e408-753a-493a-8df1-f77962535f3d/volumes/kubernetes.io~csi/pvc-d7cfd03c-d389-4887-958c-0be96bbda095/mount

root@ip-10-15-113-176:/host/var/lib/kubelet/pods/.../mount# ls
appendonly.aof  dump.rdb  lost+found

root@ip-10-15-113-176:/host/var/lib/kubelet/pods/.../mount# rm appendonly.aof dump.rdb

Key Takeaways

This approach is particularly useful when:

  • Pods are crashlooping and you can't exec into them
  • You need to access PVC data directly for troubleshooting
  • Standard kubectl commands aren't sufficient for the task

Important Considerations

⚠️ Warning: This method directly accesses and modifies persistent storage data. Only use this approach when:

  • You're certain the data can be safely deleted (like cache data)
  • You're working in a non-production environment
  • You understand the implications of data loss

Alternative Approaches: In production environments, consider:

  • Scaling up the PVC if possible
  • Creating data backups before cleanup
  • Using Redis-specific commands to clear cache when the pod is accessible
  • Implementing monitoring to prevent PVC space issues

Conclusion

When standard Kubernetes troubleshooting methods fail due to crashlooping pods, the node debug approach provides a powerful way to access and resolve PVC-related issues. This method saved significant time by allowing direct access to the Redis data files, ultimately resolving the crashloop and restoring service functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment