Skip to content

Instantly share code, notes, and snippets.

@ghousemohamed
Created February 9, 2024 19:00
Show Gist options
  • Save ghousemohamed/7c8c53c6c50349b11a382d51ab0900b1 to your computer and use it in GitHub Desktop.
Save ghousemohamed/7c8c53c6c50349b11a382d51ab0900b1 to your computer and use it in GitHub Desktop.
Things to look out for at regular intervals

Daily system & cost audit

  1. Login to AWS Cost Explorer and see last day’s cost and look for abnormalities. If some service is costing more than average.

  1. Nodegroup audits:
# Check at regular intervals if the builds nodegroup is scaling down to zero or not
kubectl get nodes -l=eks.amazonaws.com/nodegroup=builds-v2
# Check at regular intervals if the review apps addon nodegroup is scaling down to zero
kubectl get nodes -l=eks.amazonaws.com/nodegroup=neetodeploy-addons-node-group-ap-south-1b

If nodegroup never scales down to zero, something is wrong.


  1. Monitor docker registry:

https://metrics.neetodeployapp.com/graph?g0.expr=avg(container_memory_usage_bytes{container%3D"docker-registry-container-v2"}) by (pod) OR on() vector(0)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=12h

Alternatively we can just run:

kubectl top pods | grep docker-registry

It should be below 10Gb, and 3vcpu.


  1. Monitor Kpack (webhook, controller) resource usage:

kpack -

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"kpack"%2C pod%3D~"kpack-controller.*"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

kpack-webhook

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"kpack"%2C pod%3D~"kpack-webhook.*"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

kubectl top pods -n kpack | grep kpack

It should be below 4Gb, and 1 vcpu


  1. Monitor traefik resource usage: (8Gi, 3vcpu)

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"traefik"%2C container%3D"traefik"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=12h

kubectl top pods -n traefik | grep traefik

It should be below 8Gb, and 3vcpu


  1. Monitor pod-idling and downtime service:

https://neeto-engineering.neetodeploy.com/apps/pod-idling-service/metrics

https://neeto-engineering.neetodeploy.com/apps/neeto-deploy-downtime-service/metrics

There should be enough memory and cpu available. We can increase the plan if so.


  1. Monitor EFS usage:
kubectl exec -it deployments/docker-registry-deployment-v2 — df

The result of df should be under 1000gb at all time. If it goes above, something is wrong. Ideally it will be hovering at 800gb.


  1. Check for ImagePullBackOff, ErrImagePull
kubectl get pods --all-namespaces | grep ImagePullBackOff

If any pod other than “neeto-og-generator” is in ImagePullBackOff, something is wrong with our image registry.


  1. We also need to check the restart count of all the following pods, if it has increased we need to investigate why.
  • kpack, traefik, cluster-autoscaler, prometheus, grafana, docker-registry, fluent-bit, pod-idling, downtime-service
kubectl get pods --all-namespaces | grep "kpack\|traefik\|docker\|cluster-autoscaler\|prometheus\|grafana\|fluent-bit\|pod-idling\|downtime-service"

  1. Check for CrashLoopBackOff, Error
kubectl get pods --all-namespaces | grep CrashLoopBackOff

If any pod is in CrashLoopBackOff, we need to see if it is because of postgres/redis addon. If it is because of application code then we can ignore it.


  1. Check if console deployments are in Error or Completed states
kubectl get pods --all-namespaces | grep "Error\|Completed\|console"

I'm running these commands every 1 hour via crontab and appending the output to different files with timestamp and date. At the end of the day, I check the folder. Here's the crontab script:

# Nodegroup audits
0 * * * * kubectl_get_nodes_builds.sh >> /path/to/nodegroup_audit_builds_$(date +\%Y-\%m-\%d).log
0 * * * * kubectl_get_nodes_review_apps.sh >> /path/to/nodegroup_audit_review_apps_$(date +\%Y-\%m-\%d).log

# Monitor docker registry
0 * * * * kubectl_top_pods_docker_registry.sh >> /path/to/docker_registry_monitor_$(date +\%Y-\%m-\%d).log

# Monitor Kpack resource usage
0 * * * * kubectl_top_pods_kpack.sh >> /path/to/kpack_resource_monitor_$(date +\%Y-\%m-\%d).log

# Monitor Traefik resource usage
0 * * * * kubectl_top_pods_traefik.sh >> /path/to/traefik_resource_monitor_$(date +\%Y-\%m-\%d).log

# Monitor pod-idling and downtime service
0 * * * * monitor_pod_idling_downtime.sh >> /path/to/pod_idling_downtime_monitor_$(date +\%Y-\%m-\%d).log

# Monitor EFS usage
0 * * * * kubectl_exec_df.sh >> /path/to/efs_usage_monitor_$(date +\%Y-\%m-\%d).log

# Check for ImagePullBackOff and ErrImagePull
0 * * * * kubectl_get_pods_image_pull_issues.sh >> /path/to/image_pull_issues_$(date +\%Y-\%m-\%d).log

# Check restart count of specific pods
0 * * * * kubectl_get_pods_restart_count.sh >> /path/to/pods_restart_count_$(date +\%Y-\%m-\%d).log

# Check for CrashLoopBackOff and Error
0 * * * * kubectl_get_pods_crash_loop_issues.sh >> /path/to/crash_loop_issues_$(date +\%Y-\%m-\%d).log

# Check console deployments states
0 * * * * kubectl_get_pods_console_states.sh >> /path/to/console_deployments_states_$(date +\%Y-\%m-\%d).log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment