- Login to AWS Cost Explorer and see last day’s cost and look for abnormalities. If some service is costing more than average.
- Nodegroup audits:
# Check at regular intervals if the builds nodegroup is scaling down to zero or not
kubectl get nodes -l=eks.amazonaws.com/nodegroup=builds-v2
# Check at regular intervals if the review apps addon nodegroup is scaling down to zero
kubectl get nodes -l=eks.amazonaws.com/nodegroup=neetodeploy-addons-node-group-ap-south-1b
If nodegroup never scales down to zero, something is wrong.
- Monitor docker registry:
Alternatively we can just run:
kubectl top pods | grep docker-registry
It should be below 10Gb, and 3vcpu.
- Monitor Kpack (webhook, controller) resource usage:
kpack -
kpack-webhook
kubectl top pods -n kpack | grep kpack
It should be below 4Gb, and 1 vcpu
- Monitor traefik resource usage: (8Gi, 3vcpu)
kubectl top pods -n traefik | grep traefik
It should be below 8Gb, and 3vcpu
- Monitor pod-idling and downtime service:
https://neeto-engineering.neetodeploy.com/apps/pod-idling-service/metrics
https://neeto-engineering.neetodeploy.com/apps/neeto-deploy-downtime-service/metrics
There should be enough memory and cpu available. We can increase the plan if so.
- Monitor EFS usage:
kubectl exec -it deployments/docker-registry-deployment-v2 — df
The result of df
should be under 1000gb at all time. If it goes above, something is wrong. Ideally it will be hovering at 800gb.
- Check for
ImagePullBackOff
,ErrImagePull
kubectl get pods --all-namespaces | grep ImagePullBackOff
If any pod other than “neeto-og-generator” is in ImagePullBackOff, something is wrong with our image registry.
- We also need to check the restart count of all the following pods, if it has increased we need to investigate why.
- kpack, traefik, cluster-autoscaler, prometheus, grafana, docker-registry, fluent-bit, pod-idling, downtime-service
kubectl get pods --all-namespaces | grep "kpack\|traefik\|docker\|cluster-autoscaler\|prometheus\|grafana\|fluent-bit\|pod-idling\|downtime-service"
- Check for
CrashLoopBackOff
,Error
kubectl get pods --all-namespaces | grep CrashLoopBackOff
If any pod is in CrashLoopBackOff, we need to see if it is because of postgres/redis addon. If it is because of application code then we can ignore it.
- Check if console deployments are in
Error
orCompleted
states
kubectl get pods --all-namespaces | grep "Error\|Completed\|console"
I'm running these commands every 1 hour via crontab and appending the output to different files with timestamp and date. At the end of the day, I check the folder. Here's the crontab script:
# Nodegroup audits
0 * * * * kubectl_get_nodes_builds.sh >> /path/to/nodegroup_audit_builds_$(date +\%Y-\%m-\%d).log
0 * * * * kubectl_get_nodes_review_apps.sh >> /path/to/nodegroup_audit_review_apps_$(date +\%Y-\%m-\%d).log
# Monitor docker registry
0 * * * * kubectl_top_pods_docker_registry.sh >> /path/to/docker_registry_monitor_$(date +\%Y-\%m-\%d).log
# Monitor Kpack resource usage
0 * * * * kubectl_top_pods_kpack.sh >> /path/to/kpack_resource_monitor_$(date +\%Y-\%m-\%d).log
# Monitor Traefik resource usage
0 * * * * kubectl_top_pods_traefik.sh >> /path/to/traefik_resource_monitor_$(date +\%Y-\%m-\%d).log
# Monitor pod-idling and downtime service
0 * * * * monitor_pod_idling_downtime.sh >> /path/to/pod_idling_downtime_monitor_$(date +\%Y-\%m-\%d).log
# Monitor EFS usage
0 * * * * kubectl_exec_df.sh >> /path/to/efs_usage_monitor_$(date +\%Y-\%m-\%d).log
# Check for ImagePullBackOff and ErrImagePull
0 * * * * kubectl_get_pods_image_pull_issues.sh >> /path/to/image_pull_issues_$(date +\%Y-\%m-\%d).log
# Check restart count of specific pods
0 * * * * kubectl_get_pods_restart_count.sh >> /path/to/pods_restart_count_$(date +\%Y-\%m-\%d).log
# Check for CrashLoopBackOff and Error
0 * * * * kubectl_get_pods_crash_loop_issues.sh >> /path/to/crash_loop_issues_$(date +\%Y-\%m-\%d).log
# Check console deployments states
0 * * * * kubectl_get_pods_console_states.sh >> /path/to/console_deployments_states_$(date +\%Y-\%m-\%d).log