Daily system & cost audit

Login to AWS Cost Explorer and see last day’s cost and look for abnormalities. If some service is costing more than average.

Nodegroup audits:

# Check at regular intervals if the builds nodegroup is scaling down to zero or not
kubectl get nodes -l=eks.amazonaws.com/nodegroup=builds-v2
# Check at regular intervals if the review apps addon nodegroup is scaling down to zero
kubectl get nodes -l=eks.amazonaws.com/nodegroup=neetodeploy-addons-node-group-ap-south-1b

If nodegroup never scales down to zero, something is wrong.

Monitor docker registry:

https://metrics.neetodeployapp.com/graph?g0.expr=avg(container_memory_usage_bytes{container%3D"docker-registry-container-v2"}) by (pod) OR on() vector(0)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=12h

Alternatively we can just run:

kubectl top pods | grep docker-registry

It should be below 10Gb, and 3vcpu.

Monitor Kpack (webhook, controller) resource usage:

kpack -

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"kpack"%2C pod%3D~"kpack-controller.*"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

kpack-webhook

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"kpack"%2C pod%3D~"kpack-webhook.*"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

kubectl top pods -n kpack | grep kpack

It should be below 4Gb, and 1 vcpu

Monitor traefik resource usage: (8Gi, 3vcpu)

https://metrics.neetodeployapp.com/graph?g0.expr=container_memory_usage_bytes{namespace%3D"traefik"%2C container%3D"traefik"}&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=12h

kubectl top pods -n traefik | grep traefik

It should be below 8Gb, and 3vcpu

Monitor pod-idling and downtime service:

https://neeto-engineering.neetodeploy.com/apps/pod-idling-service/metrics

https://neeto-engineering.neetodeploy.com/apps/neeto-deploy-downtime-service/metrics

There should be enough memory and cpu available. We can increase the plan if so.

Monitor EFS usage:

kubectl exec -it deployments/docker-registry-deployment-v2 — df

The result of df should be under 1000gb at all time. If it goes above, something is wrong. Ideally it will be hovering at 800gb.

Check for ImagePullBackOff, ErrImagePull

kubectl get pods --all-namespaces | grep ImagePullBackOff

If any pod other than “neeto-og-generator” is in ImagePullBackOff, something is wrong with our image registry.

We also need to check the restart count of all the following pods, if it has increased we need to investigate why.

kpack, traefik, cluster-autoscaler, prometheus, grafana, docker-registry, fluent-bit, pod-idling, downtime-service

kubectl get pods --all-namespaces | grep "kpack\|traefik\|docker\|cluster-autoscaler\|prometheus\|grafana\|fluent-bit\|pod-idling\|downtime-service"

Check for CrashLoopBackOff, Error

kubectl get pods --all-namespaces | grep CrashLoopBackOff

If any pod is in CrashLoopBackOff, we need to see if it is because of postgres/redis addon. If it is because of application code then we can ignore it.

Check if console deployments are in Error or Completed states

kubectl get pods --all-namespaces | grep "Error\|Completed\|console"

I'm running these commands every 1 hour via crontab and appending the output to different files with timestamp and date. At the end of the day, I check the folder. Here's the crontab script:

# Nodegroup audits
0 * * * * kubectl_get_nodes_builds.sh >> /path/to/nodegroup_audit_builds_$(date +\%Y-\%m-\%d).log
0 * * * * kubectl_get_nodes_review_apps.sh >> /path/to/nodegroup_audit_review_apps_$(date +\%Y-\%m-\%d).log

# Monitor docker registry
0 * * * * kubectl_top_pods_docker_registry.sh >> /path/to/docker_registry_monitor_$(date +\%Y-\%m-\%d).log

# Monitor Kpack resource usage
0 * * * * kubectl_top_pods_kpack.sh >> /path/to/kpack_resource_monitor_$(date +\%Y-\%m-\%d).log

# Monitor Traefik resource usage
0 * * * * kubectl_top_pods_traefik.sh >> /path/to/traefik_resource_monitor_$(date +\%Y-\%m-\%d).log

# Monitor pod-idling and downtime service
0 * * * * monitor_pod_idling_downtime.sh >> /path/to/pod_idling_downtime_monitor_$(date +\%Y-\%m-\%d).log

# Monitor EFS usage
0 * * * * kubectl_exec_df.sh >> /path/to/efs_usage_monitor_$(date +\%Y-\%m-\%d).log

# Check for ImagePullBackOff and ErrImagePull
0 * * * * kubectl_get_pods_image_pull_issues.sh >> /path/to/image_pull_issues_$(date +\%Y-\%m-\%d).log

# Check restart count of specific pods
0 * * * * kubectl_get_pods_restart_count.sh >> /path/to/pods_restart_count_$(date +\%Y-\%m-\%d).log

# Check for CrashLoopBackOff and Error
0 * * * * kubectl_get_pods_crash_loop_issues.sh >> /path/to/crash_loop_issues_$(date +\%Y-\%m-\%d).log

# Check console deployments states
0 * * * * kubectl_get_pods_console_states.sh >> /path/to/console_deployments_states_$(date +\%Y-\%m-\%d).log

ghousemohamed/vigil.md

Select an option

No results found

Select an option

No results found

Daily system & cost audit