Nice meeting you! I'm Jim, a Product Manager for the Cloud Operations team that operates Pivotal Web Services. This email is a work in progress but, here are some things we do as a Cloud Operations team that might be useful patterns for you as a customer, or as a new CloudOps member.
Pivotal Web Services is the reference deployment of Cloud Foundry. The Cloud Operations team is based out of San Francisco, colocated with a number of Pivotal's Cloud Foundry Research and Development teams.
As a reference deployment, Pivotal Web Services acts as a first customer to OSS Cloud Foundry and services. Our mission is to be first to deploy the latest version of Cloud Foundry CF-Release. Once a new release is cut, we typically deploy the in under 7 days. We think that's pretty good, and we're shooting for even better.
As a production deployment, Pivotal Web Services needs to be online, all the time. Our mission is to keep production alive and healthy, as well as up to date.
I've put up the beginnings of a centralized place where I'll be collecting examples of the various tools we use to monitor. Watch this space:
https://github.com/pivotal-cloudops/monitoring-pcf
We monitor the platform using datadog.
Reasons we like datadog:
- SaaS, so no upkeep.
- They just released a BOSH-release for their agent.
Reasons we don't like datadog:
- Constructing dashboards are a little weird.
- Full window of data for evaluation
How we use datadog:
- The code we use to manage it is specifically designed for use cases where multiple environments share dashboards. As a result, the code is clunky.
- This wrapper repo (private) is how we consume the OSS repo. It contains the config file for our various environments which specify what tags to apply. It then symlinks the structure of the underlying submodule.
We use smoke tests kind of like integration tests. When the smoke test passes, we know that key functionality in the system is working.
Reasons we like smoke tests:
- It's definitive when it passes. When it passes, the system is healthy.
Reasons we don't like smoke tests:
- If a app fails to delete in 30 seconds, then the smoke test fails. A CF delete is an asynchronous activity though, so this isn't a fair measure.
- Likewise, we consider failing smoke tests to be an indicator of problems, but not necessarily a critical one.*
How we use smoke tests:
-
We run cf-smoke-tests on a regular basis in Concourse.
-
Here is an (unsupported) Dockerfile version to run using Docker.
-
Here is an (unsupported) CF version to deploy as an app.
-
Here is a repo with reference pipelines. These are taken directly from production pipelines.
We've identified Deploy Cadence as our "Bottleneck Constraint".
How often we hit bosh deploy
determines how quickly we deploy ∆.
We typically deploy to Pivotal Web Services 1-3 times day. Examples of daily deployments include:
- Update Cloud Foundry CF-Release
- Update Cloud Foundry BOSH-release
- Deploy a new feature like Cf-Networking for the first time in public
We maintain this deploy cadence by relying on tools that allow us to build trust in our system.