-
Create a new project, for example
slo-generator-demo
, and set it to the current project, for example with:gcloud projects create cloud-operations-sandbox-a5r3 --set-as-default
-
Open Cloud Shell and save the project ID and project number to environment variable with:
export PROJECT_ID=$(gcloud config get-value project) export PROJECT_NUMBER=$(gcloud projects list --filter="$(gcloud config get-value project)" --format="value(PROJECT_NUMBER)")
-
If provisioning in Argolis, add the expected default network and override any organization policy that could prevent the creation of resources like the K8s cluster or the public Cloud SQL instance required by the Cloud Ops Sandbox, for example with:
gcloud services enable compute.googleapis.com gcloud compute networks create default for boolean_policy_id in sql.restrictPublicIp compute.requireShieldedVm compute.requireOsLogin do gcloud resource-manager org-policies \ disable-enforce constraints/${boolean_policy_id} \ --project=${PROJECT_ID} done cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin constraint: constraints/compute.vmExternalIpAccess listPolicy: allValues: ALLOW EOF cat <<EOF | gcloud resource-manager org-policies set-policy --project=${PROJECT_ID} /dev/stdin constraint: constraints/compute.vmCanIpForward listPolicy: allValues: ALLOW EOF
-
Provision the Cloud Ops Sandbox resources in this existing project with:
pip3 install google-cloud-pubsub git clone https://github.com/GoogleCloudPlatform/cloud-ops-sandbox cd cloud-ops-sandbox/provisioning ./sandboxctl create -p ${PROJECT_ID}
-
Wait a bit (20 minutes or so for a full provisioning!) for the success message:
module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[0]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/8248738219467070757] module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[8]: Creation complete after 26s [id=projects/united-concord-398009/alertPolicies/452382904543783588] module.monitoring.google_monitoring_alert_policy.availability_slo_burn_alert[1]: Creation complete after 25s [id=projects/united-concord-398009/alertPolicies/16173015502557114609] Apply complete! Resources: 83 added, 0 changed, 0 destroyed. Outputs: frontend_external_ip = "34.31.187.72" Explore Cloud Ops Sandbox features by browsing GKE Dashboard: https://console.cloud.google.com/kubernetes/workload?project=united-concord-398009 Monitoring Workspace: https://console.cloud.google.com/monitoring/?project=united-concord-398009 Try Online Boutique at http://34.31.187.72/
-
Click every URL and confirm everything works as expected.
-
Install the SLO Generator from PyPI, export the required environment variables and run a simple example from the documentation.
pip3 install slo-generator[cloud-monitoring] mkdir slo-generator-demo cd slo-generator-demo export GAE_PROJECT_ID=${PROJECT_ID} export CLOUD_OPS_PROJECT_ID=${PROJECT_ID} export COLORED_OUTPUT=1 cat <<EOF > slo_gae_app_availability.yaml apiVersion: sre.google.com/v2 kind: ServiceLevelObjective metadata: name: gae-app-availability labels: service_name: gae feature_name: app slo_name: availability spec: description: Availability of App Engine app backend: cloud_monitoring method: good_bad_ratio exporters: - cloud_monitoring service_level_indicator: filter_good: > project=${GAE_PROJECT_ID} metric.type="appengine.googleapis.com/http/server/response_count" resource.type="gae_app" ( metric.labels.response_code = 429 OR metric.labels.response_code = 200 OR metric.labels.response_code = 201 OR metric.labels.response_code = 202 OR metric.labels.response_code = 203 OR metric.labels.response_code = 204 OR metric.labels.response_code = 205 OR metric.labels.response_code = 206 OR metric.labels.response_code = 207 OR metric.labels.response_code = 208 OR metric.labels.response_code = 226 OR metric.labels.response_code = 304 ) filter_valid: > project=${GAE_PROJECT_ID} metric.type="appengine.googleapis.com/http/server/response_count" goal: 0.95 EOF cat <<EOF > shared_config.yaml backends: cloud_monitoring: project_id: ${CLOUD_OPS_PROJECT_ID} exporters: cloud_monitoring: project_id: ${CLOUD_OPS_PROJECT_ID} error_budget_policies: default: steps: - name: 1 hour burn_rate_threshold: 9 alert: true message_alert: Page to defend the SLO message_ok: Last hour on track window: 3600 - name: 12 hours burn_rate_threshold: 3 alert: true message_alert: Page to defend the SLO message_ok: Last 12 hours on track window: 43200 - name: 7 days burn_rate_threshold: 1.5 alert: false message_alert: Dev team dedicates 25% of engineers to the reliability backlog message_ok: Last week on track window: 604800 - name: 28 days burn_rate_threshold: 1 alert: false message_alert: Freeze release, unless related to reliability or security message_ok: Unfreeze release, per the agreed roll-out policy window: 2419200 EOF slo-generator compute -f slo_gae_app_availability.yaml -c shared_config.yaml
-
Confirm all four SLOs are computed and displayed correctly, with an output like:
INFO - gae-app-availability | 1 hour | SLI: 100.0 % | SLO: 95.0 % | Gap: +5.0 % | BR: 0.0 / 9.0 | Alert: 0 | Good: 1085 | Bad: 0 INFO - gae-app-availability | 12 hours | SLI: 99.7078 % | SLO: 95.0 % | Gap: +4.71 % | BR: 0.1 / 3.0 | Alert: 0 | Good: 13647 | Bad: 40 INFO - gae-app-availability | 7 days | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.5 | Alert: 0 | Good: 50382 | Bad: 250 INFO - gae-app-availability | 28 days | SLI: 99.5062 % | SLO: 95.0 % | Gap: +4.51 % | BR: 0.1 / 1.0 | Alert: 0 | Good: 50382 | Bad: 250 INFO - Run finished successfully in 3.0s. INFO - Run summary | SLO Configs: 1 | Duration: 3.0s
-
Go to Monitoring > Metrics explorer and show that the following MQL query yields the same results for the good events:
fetch gae_app | metric 'appengine.googleapis.com/http/server/response_count' | { filter (metric.response_code == 200 || metric.response_code == 201 || metric.response_code == 202 || metric.response_code == 203 || metric.response_code == 203 || metric.response_code == 204 || metric.response_code == 205 || metric.response_code == 206 || metric.response_code == 207 || metric.response_code == 208 || metric.response_code == 226 || metric.response_code == 304 || metric.response_code == 429) ; ident } | ratio
-
Provision a Cloud Run service that randomly fails and returns a 500 code. The Cloud Code sample Python project (with Flask, not Django) can easily be modified to offer this “feature”, for example by returning an error 25% of the time based on a random number generator:
import os from random import randint [...] @app.route('/') def hello(): """Return either a friendly HTTP greeting or a 5xx error.""" return_an_error = (randint(1, 4) == 1) if return_an_error: abort(500) message = "It's running!" [...]
Note that you might have to override another Org Policy in Argolis to allow unauthenticated users to connect to this Cloud Run Service. See Argolis Troubleshooting Tips for more details. Or configure the service to require authentication and call it from the command line as an authenticated user:
$ curl -H "Authorization: Bearer $(gcloud auth print-identity-token)" https://cloud-run-randomly-fails-4zbr2zmcxq-od.a.run.app <!doctype html> <html lang=en> <title>500 Internal Server Error</title> <h1>Internal Server Error</h1> <p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
-
Retrieve the service URL with:
export CLOUD_RUN_SERVICE_URL=$(gcloud run services list --filter="cloud-run-randomly-fails" --format="value(URL)")
-
Generate traffic manually, for example with:
TOKEN=`gcloud auth print-access-token` for i in {1..10}; do curl -H "Authorization: Bearer ${TOKEN}" ${CLOUD_RUN_SERVICE_URL}; done
-
Query the API using MQL with:
cat > query.json << EOF { "query": "fetch cloud_run_revision | metric 'run.googleapis.com/request_count' | { filter metric.response_code_class == '2xx' ; ident } | ratio | group_by [] | within 3600s | every 3600s" } EOF curl -d @query.json \ -H "Authorization: Bearer ${TOKEN}" \ --header "Content-Type: application/json" \ -X POST \ https://monitoring.googleapis.com/v3/projects/${PROJECT_ID}/timeSeries:query
-
Generate traffic and historical data every minute for later reuse with a Cloud Scheduler Job:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:${PROJECT_NUMBER}[email protected] \ --role roles/run.invoker gcloud scheduler jobs create http cloud-run-randomly-fails-load-tester \ --schedule "* * * * *" \ --uri "${CLOUD_RUN_SERVICE_URL}" \ --http-method GET \ --oidc-service-account-email ${PROJECT_NUMBER}[email protected] \ --location europe-west1
-
Use a dedicated project for the Cloud Operations Sandbox, so it is easier to provision and tear down? Or let Cloud Operations Sandbox CLI create one under the
[email protected]
identity to avoid billing account issues? Then keepslo-generator-demo
pretty lean, with just a Cloud Run service that randomly fails and a load generator (with a Cloud Scheduler job). Here I had to destroy the sandbox with:laurent@cloudshell:~ (slo-generator-demo)$ cd cloud-ops-sandbox/terraform/ laurent@cloudshell:~ (slo-generator-demo)$ project_id=$(gcloud config get-value project) laurent@cloudshell:~ (slo-generator-demo)$ bucket_name="${project_id}-bucket" laurent@cloudshell:~ (slo-generator-demo)$ terraform init -backend-config "bucket=${bucket_name}" laurent@cloudshell:~ (slo-generator-demo)$ terraform destroy -var="project_id=${project_id}" -var="bucket_name=${bucket_name}"
And no file from the original repo was modified:
laurent@cloudshell:~/cloud-ops-sandbox (slo-generator-demo)$ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean