Overview

The following is based on 2 GCP / GKE How to Guides:

What this adds is additional checks / verification commands that can be run to help troubleshoot.

Step 0: Prereqs

provisioned GKE autopilot cluster v1.26.5 with all defaults
alias k=kubectl
get kubectl access to it
make sure kubectl get node shows at least 1 node. (Note a fresh GKE autopilot cluster will wait for you to deploy a pod before it'll provision nodes, so if you don't see at least 1 node run k run nginx --image=nginx, to force provision.)

Step 1: Set bash env vars

export PROJECT=chrism-playground-369416
export NAMESPACE=test
export SA_SHORT_NAME=gmp-test-sa
export SA_NAME=$SA_SHORT_NAME@$PROJECT.iam.gserviceaccount.com

Step 2: GKE Autopilot defaults to GMP (google managed prometheus) enabled & workload identity enabled

# Verify you see the 4 workloads associated with GMP enabled
k get po -n=gke-gmp-system 
# from --^, you should see alertmanager, gmp-operator, rule-evaluator, and collector,
# if you don't see collector check that you have at least 1 node running,
# since collector is a daemonset it'll wait for 1 node to exist before showing up.

# Verify nodes labeled with metadata-server-enabled (means workload identity enabled)
k get node -L=iam.gke.io/gke-metadata-server-enabled
# NAME                                           STATUS   ROLES    AGE     VERSION            GKE-METADATA-SERVER-ENABLED
# gk3-autopilot-cluster-1-pool-1-16322d28-zqw4   Ready    <none>   175m    v1.26.5-gke.1200   true
# gk3-autopilot-cluster-1-pool-1-4be8ad63-jlbw   Ready    <none>   6m31s   v1.26.5-gke.1200   true
# gk3-autopilot-cluster-1-pool-1-dbcf5863-76pf   Ready    <none>   168m    v1.26.5-gke.1200   true

Step 3: Deploy test workload and see prom metrics via curl

# Following comes from the docs:
###################################################################
# Deploy the demo app & podmonitor custom resource

k create ns $NAMESPACE

kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/pod-monitoring.yaml

kubectl -n $NAMESPACE apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/example-app.yaml

k scale deploy prom-example --replicas=1 -n=$NAMESPACE
####################################################################
# Verification Commands:

k get po -o wide -n=$NAMESPACE
# (pod IP of prom-example-5cd7b77867-plwh4 is 10.8.0.72)

k get deploy prom-example -n=$NAMESPACE -o yaml | grep "name: metrics" -B 2
#        ports:
#        - containerPort: 1234
#          name: metrics
# (^-- says metrics available on port 1234)

k run -it curl --image=curlimages/curl -n=$NAMESPACE -- sh
# ^-- run this from laptop shell, to gain pod shell

exit
# ^-- exit from pod shell to laptop shell

k exec -it pod/curl -n=$NAMESPACE -- curl 10.8.0.72:1234/metrics
# ^-- run this from laptop shell, runs curl from within pod named curl without switching shell context
# ^-- shows prometheus metrics
# 
# ...
# example_random_numbers_bucket{le="+Inf"} 4.4757322148e+10
# example_random_numbers_sum 3.571108803002773e+10
# example_random_numbers_count 4.4757322148e+10
# # HELP example_requests_total Total number of HTTP requests by status code and method.
# # TYPE example_requests_total counter
# example_requests_total{code="200",method="get"} 8611
# ...
#####################################################################

Step 4: Let's look at the metrics from the GCP GUI

Go to the following URL
https://console.cloud.google.com/monitoring/metrics-explorer
If necessary update the URL to contain your project
https://console.cloud.google.com/monitoring/metrics-explorer?project=chrism-playground-369416
Click through the following logic to access Prom QL Query mode:

v-- Switch Mode from Builder to Code: (Query Language)
v-- Switch Query Langauge to PromQL
v-- Update timeframe to something like last 3 hours, and reference a prometheus metric that you saw when curling the prometheus metric endpoint. (example_requests_total came from curling the prometheus metric endpoint)

Step 5: Deploy the prometheus GUI

Step 5A: Overview

The following link offers a good architecture diagram of how the prometheus GUI works
https://cloud.google.com/stackdriver/docs/managed-prometheus#gmp-system-overview
To summarize it
- Google has a metrics database called Monarch, it has a Prometheus API compatibility layer that's ~95% compatible with prometheus.
- gmp-operator pod in gke-gmp-system namespace, configures collector pods in the same namespace based on podmonitor custom resources.
- collector pod is basically a prometheus metric shipping agent, that ships to the monarch prom API endpoint, and metrics are stored in monarch
- In step 5, we'll deploy a prom GUI frontend that will read from monarch, and present the data more like a traditional prometheus install that something like Grafana can read from.

Step 5B: Make a service account from the Prometheus Frontend GUI & Verify it has the correct Rights

# v-- set gcloud context
gcloud config set project $PROJECT

# v-- create GCP SA
gcloud iam service-accounts create $SA_SHORT_NAME

# v-- annotate the default Kube SA in the namespace, with a reference to the GCP SA, 
#     to establish a link between them as needed by GKE workload identity.
kubectl annotate serviceaccount \
  default \
  --namespace $NAMESPACE \
  iam.gke.io/gcp-service-account=$SA_NAME

# v-- add workloadIdentityUser GCP IAM Role, to default Kubernetes service
#     account in a kube namespace
gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$PROJECT.svc.id.goog[$NAMESPACE/default]" \
  $SA_NAME

# v-- add monitoring.viewer GCP IAM role to the GCP SA, which the kube SA
#     is now linked to
gcloud projects add-iam-policy-binding $PROJECT \
  --member=serviceAccount:$SA_NAME \
  --role=roles/monitoring.viewer

# Note: we won't specify a service account for the Prometheus GUI, so it'll 
#       use the default service account in that namespace, the above gave 
#       rights to the default service account in the namespace.

#############################################################################
# Verification Commands to validate what was just done

k get sa default -n=$NAMESPACE -o yaml | grep annotation -A 1
#  annotations:
#    iam.gke.io/gcp-service-account: [email protected]

gcloud projects get-iam-policy $PROJECT \
 --flatten="bindings[].members" \
 --format='table(bindings.role)' \
 --filter="bindings.members:$SA_NAME" 
# ROLE
# roles/monitoring.viewer

gcloud asset search-all-iam-policies --scope=projects/$PROJECT --query="$SA_NAME"
# ...
# policy:
#   bindings:
#   - members:
#     - serviceAccount:chrism-playground-369416.svc.id.goog[test/default]
#     role: roles/iam.workloadIdentityUser
# ...
# policy:
#   bindings:
#   - members:
#     - serviceAccount:[email protected]
#     role: roles/monitoring.viewer
# ...

Step 5C: Deploy Prometheus Frontend GUI

These instructions are mostly based on the docs, with a purposeful mistake, which can help debugging.

# v-- The sed is where the mistake happens
curl https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/v0.7.0/examples/frontend.yaml |
sed 's/\$PROJECT_ID/$PROJECT/' |
kubectl apply -n $NAMESPACE -f -

k scale deploy frontend -n=$NAMESPACE --replicas=1
#  ^-- it defaults to 2 replicas which is unnecessary

k get po -n=$NAMESPACE

kubectl -n $NAMESPACE port-forward svc/frontend 9090

If the frontend.yaml is incorrect it may look like this
Here's the problem
Once the --query.project-id flag in the YAML is fixed to correctly reference the project where the IAM is configured for, it'll start to work and look like the following

neoakris/prom-gui_and_GMP_on_GKE_autopilot.md

Overview

Step 0: Prereqs

Step 1: Set bash env vars

Step 2: GKE Autopilot defaults to GMP (google managed prometheus) enabled & workload identity enabled

Step 3: Deploy test workload and see prom metrics via curl

Step 4: Let's look at the metrics from the GCP GUI

Step 5: Deploy the prometheus GUI

Step 5A: Overview

Step 5B: Make a service account from the Prometheus Frontend GUI & Verify it has the correct Rights

Step 5C: Deploy Prometheus Frontend GUI

mcscwizzy commented Aug 16, 2023