The following is based on 2 GCP / GKE How to Guides:
- https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed
- https://cloud.google.com/stackdriver/docs/managed-prometheus/query
What this adds is additional checks / verification commands that can be run to help troubleshoot.
- provisioned GKE autopilot cluster v1.26.5 with all defaults
alias k=kubectl
- get kubectl access to it
- make sure kubectl get node shows at least 1 node. (Note a fresh GKE autopilot cluster will wait for you to deploy a pod before it'll provision nodes, so if you don't see at least 1 node run
k run nginx --image=nginx
, to force provision.)
export PROJECT=chrism-playground-369416
export NAMESPACE=test
export SA_SHORT_NAME=gmp-test-sa
export SA_NAME=$SA_SHORT_NAME@$PROJECT.iam.gserviceaccount.com
Step 2: GKE Autopilot defaults to GMP (google managed prometheus) enabled & workload identity enabled
# Verify you see the 4 workloads associated with GMP enabled
k get po -n=gke-gmp-system
# from --^, you should see alertmanager, gmp-operator, rule-evaluator, and collector,
# if you don't see collector check that you have at least 1 node running,
# since collector is a daemonset it'll wait for 1 node to exist before showing up.
# Verify nodes labeled with metadata-server-enabled (means workload identity enabled)
k get node -L=iam.gke.io/gke-metadata-server-enabled
# NAME STATUS ROLES AGE VERSION GKE-METADATA-SERVER-ENABLED
# gk3-autopilot-cluster-1-pool-1-16322d28-zqw4 Ready <none> 175m v1.26.5-gke.1200 true
# gk3-autopilot-cluster-1-pool-1-4be8ad63-jlbw Ready <none> 6m31s v1.26.5-gke.1200 true
# gk3-autopilot-cluster-1-pool-1-dbcf5863-76pf Ready <none> 168m v1.26.5-gke.1200 true
# Following comes from the docs:
###################################################################
# Deploy the demo app & podmonitor custom resource
k create ns $NAMESPACE
kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/pod-monitoring.yaml
kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/example-app.yaml
k scale deploy prom-example --replicas=1 -n=$NAMESPACE
####################################################################
# Verification Commands:
k get po -o wide -n=$NAMESPACE
# (pod IP of prom-example-5cd7b77867-plwh4 is 10.8.0.72)
k get deploy prom-example -n=$NAMESPACE -o yaml | grep "name: metrics" -B 2
# ports:
# - containerPort: 1234
# name: metrics
# (^-- says metrics available on port 1234)
k run -it curl --image=curlimages/curl -n=$NAMESPACE -- sh
# ^-- run this from laptop shell, to gain pod shell
exit
# ^-- exit from pod shell to laptop shell
k exec -it pod/curl -n=$NAMESPACE -- curl 10.8.0.72:1234/metrics
# ^-- run this from laptop shell, runs curl from within pod named curl without switching shell context
# ^-- shows prometheus metrics
#
# ...
# example_random_numbers_bucket{le="+Inf"} 4.4757322148e+10
# example_random_numbers_sum 3.571108803002773e+10
# example_random_numbers_count 4.4757322148e+10
# # HELP example_requests_total Total number of HTTP requests by status code and method.
# # TYPE example_requests_total counter
# example_requests_total{code="200",method="get"} 8611
# ...
#####################################################################
- Go to the following URL
https://console.cloud.google.com/monitoring/metrics-explorer - If necessary update the URL to contain your project
https://console.cloud.google.com/monitoring/metrics-explorer?project=chrism-playground-369416 - Click through the following logic to access Prom QL Query mode:
- v-- Switch Mode from Builder to Code: (Query Language)
- v-- Switch Query Langauge to PromQL
- v-- Update timeframe to something like last 3 hours, and reference a prometheus metric that you saw when curling the prometheus metric endpoint. (example_requests_total came from curling the prometheus metric endpoint)
- The following link offers a good architecture diagram of how the prometheus GUI works
https://cloud.google.com/stackdriver/docs/managed-prometheus#gmp-system-overview - To summarize it
- Google has a metrics database called Monarch, it has a Prometheus API compatibility layer that's ~95% compatible with prometheus.
- gmp-operator pod in gke-gmp-system namespace, configures collector pods in the same namespace based on podmonitor custom resources.
- collector pod is basically a prometheus metric shipping agent, that ships to the monarch prom API endpoint, and metrics are stored in monarch
- In step 5, we'll deploy a prom GUI frontend that will read from monarch, and present the data more like a traditional prometheus install that something like Grafana can read from.
# v-- set gcloud context
gcloud config set project $PROJECT
# v-- create GCP SA
gcloud iam service-accounts create $SA_SHORT_NAME
# v-- annotate the default Kube SA in the namespace, with a reference to the GCP SA,
# to establish a link between them as needed by GKE workload identity.
kubectl annotate serviceaccount \
default \
--namespace $NAMESPACE \
iam.gke.io/gcp-service-account=$SA_NAME
# v-- add workloadIdentityUser GCP IAM Role, to default Kubernetes service
# account in a kube namespace
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT.svc.id.goog[$NAMESPACE/default]" \
$SA_NAME
# v-- add monitoring.viewer GCP IAM role to the GCP SA, which the kube SA
# is now linked to
gcloud projects add-iam-policy-binding $PROJECT \
--member=serviceAccount:$SA_NAME \
--role=roles/monitoring.viewer
# Note: we won't specify a service account for the Prometheus GUI, so it'll
# use the default service account in that namespace, the above gave
# rights to the default service account in the namespace.
#############################################################################
# Verification Commands to validate what was just done
k get sa default -n=$NAMESPACE -o yaml | grep annotation -A 1
# annotations:
# iam.gke.io/gcp-service-account: [email protected]
gcloud projects get-iam-policy $PROJECT \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:$SA_NAME"
# ROLE
# roles/monitoring.viewer
gcloud asset search-all-iam-policies --scope=projects/$PROJECT --query="$SA_NAME"
# ...
# policy:
# bindings:
# - members:
# - serviceAccount:chrism-playground-369416.svc.id.goog[test/default]
# role: roles/iam.workloadIdentityUser
# ...
# policy:
# bindings:
# - members:
# - serviceAccount:[email protected]
# role: roles/monitoring.viewer
# ...
- These instructions are mostly based on the docs, with a purposeful mistake, which can help debugging.
# v-- The sed is where the mistake happens
curl https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/frontend.yaml |
sed 's/\$PROJECT_ID/$PROJECT/' |
kubectl apply -n $NAMESPACE -f -
k scale deploy frontend -n=$NAMESPACE --replicas=1
# ^-- it defaults to 2 replicas which is unnecessary
k get po -n=$NAMESPACE
kubectl -n $NAMESPACE port-forward svc/frontend 9090
I cannot thank you enough!!! I absolutely despise Google Cloud and GKE. I'm more of an Azure/AWS person. I came into a new job and one of their first tasks was for me to fix their GMP/Grafana setup. Been at this for three days and you have saved my backside. I can't thank you enough!!!!