Skip to content

Instantly share code, notes, and snippets.

@jimbaker
Created September 15, 2016 15:51
Show Gist options
  • Save jimbaker/9717cabaa6fbcec64e0996d722a1d09f to your computer and use it in GitHub Desktop.
Save jimbaker/9717cabaa6fbcec64e0996d722a1d09f to your computer and use it in GitHub Desktop.
Lightning talk for Craton - need to add slides; also create elevator pitch

It's Friday. You have an embargoed fix you need to apply this weekend. Your cloud has 50 - or maybe 500 - or maybe 50000 physical hosts or more. Can you deploy this patch across your cloud, without causing customer outages? And without introducing new problems?

Fleet management is the answer; and the Craton project implements this approach.

Think of the hardware in your data center as a fleet, much like an admiral with his or her ships. This admiral knows ship positions, status, and more; he or she is tasked with deploying the fleet systematically.

For the cloud, and this includes OpenStack: fleet management is this combination of knowing your fleet inventory; and systematically managing this hardware - and to be systematic, this management needs to tolerate the errors seen when running at scale; and to also be data center aware.

First, you need to know what exactly you have in your fleet - and it's not just the raw assets. It's desired system configuration; it's how the fleet is being used; it's the specific hardware layout. This is your inventory.

Many organizations capture this inventory in text files. But text file inventory management is hard to manage, even if you use GitHub or other version control. Craton starts with a relational database that captures physical, network, and logical topology, so it can track: your cabinets, what is their power setup, the specific hosts in the cabinet, how the cabinet's top-of-rack switch connects into your networks; and how you use OpenStack to manage your regions/availability zones, or cells if you have implemented that feature. Because there are many multiple sources of truth - maybe your asset database, certainly Nova, Cinder, or Neutron, Craton is building out an inventory fabric that combines information from multiple places, not just its own database.

But inventory only tells you so much. It's not always consistent: a cabinet may not even be there any more - it's been decommissioned and physically removed. Or we need more details - what is the specific motherboard version, and how does it impact available physical memory for virtual machines? We need to audit our inventory. The sequence of tasks we need to perform on each host then becomes an audit workflow. Craton supports running arbitrary tasks in its task graph, but for now we focus on playbooks using OpenStack Ansible. Such audits collect information into Craton's extensible variable model.

Once you have analyzed the problem at hand, it's time to remediate the impacted hosts. The sequence of tasks we need to perform on each host then becomes a remediation workflow. By using Ansible, or similar tooling to implement these tasks, you can converge on the desired configuration. Such configuration is also stored in Craton's database using its extensible variables model: you define the settings at the level that makes the most sense. Want a security setting for an entire region; or for a grouping of compute hosts?

Let's look at the specific remediation workflow we started with: you need to migrate workloads from unpatched hosts to patched hosts, with no-to-minimal service disruption. And because you don't have a "blue cloud" and a "green cloud" -- think how much that would cost! -- you need to do this in place with the hardware you have. And your workflows must be data center aware by using the inventory information we have collected. This is because you cannot simply live migrate all of the hosts in a cabinet and saturate your switches with management traffic. Instead you must systematically sweep them.

Worfklows must also cope with scalability, especially reachability and other failures seen at scale. By using a map-reduce architecture, backed by ZooKeeper and the TaskFlow job board, Craton separates the "what" of the task graph from the "how" of actually running it.

This is at the core of how we operate Rackspace public cloud today, the largest scale OpenStack cloud out there. And it's informed how we have designed and built the next generation of this software, Craton, which we are developing to become an OpenStack "big tent" project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment