Kubernetes yaml templating.md

2019-04-07

I’ve been working my way up the Kubernetes learning curve.

These notes are about yaml.

Discovery

Using Kubernetes (K8s) involves many yaml "manifests"
yaml is a thoughly modern way to do "infrastructure as code" (IaC)
But with K8s there's a lot of it
So, I find myself confronted with "yaml inflation"
yaml is data.
yaml is "declarative" - there's nothig imperative about data
declarative is good right? Everyone loves declarative.
While it is true that yaml handles the simple case well (tutorials and all that), we also want to judge it by how well it tames the complexity of more difficult problems and, by this measure, the pure yaml approach sucks.
These notes explore "why?" and what to do about it

Aside

Yes, it sucks to be a delevoper confronted with too much yaml but, just to be clear, there's good reasons why "just data" is a good decision at a certain level of these technical stacks
Infrastructure systems like K8s and terraform work by comparing the "desired state" of a system with the "currently observed state" of that system, and their job is to then alter the "observed state" to match the "desired state"
And the "desired state" has to be represented as data and be put in a datastore of some kind, etcd or an S3 bucket, etc.
So, to a certain level of that technical stack "declarative, simple data" is nice
But, what works well at one level does not work at another level, and lots of yaml is quickly bad for us at the developer level.

The Problem

yaml is data at rest. Lovely, pure data.

If this data conforms to a given structure it specifies a Kubernetes Cluster. This means there's a DSL and an interpreter. This pure data format is a DSL which allows you to encode "instructions" to be "interpreted" by the Kubernetes cluster. So, as you write this conforming yaml, you are writing "a program" for the Kubernetes Control plane to execute.

XXX discuss assembler instructions. XXX discuss how writing assembler is tedious. Most people choose to write in a higher level language.

That's one perspective. But there's another ...

Many related yaml manifest files sitting in a directory collectively form a denormalised database with no referential integrity. This is where things start to suck.

Referential Integrity:

sibling yaml files are likly to share common symbols
for example, the label for a Pod in a deployment.yaml might also used for "selection" in an associated service.yaml
but the integrity of these references is not enforced, not at the level of yaml, not the way a relational database would enforce it.
one of the files could be editted and the symbol changed, but not the other, so now one contains the wrong "reference". Its busted but we don't know.

Denormalised

database normalisation is the eqivalent of programming's DRY
imagine two pieces of yaml which are almost the same, except for subtle variations. In production, one of the Pods uses a smaller sized machine, but in production a larger one. Other than that, the two are the same.
again, imagine two directories containing yaml which is smilar - one for "prod" and one for "staging"
if this was a relational database, we'd have ways to normaise/factor out the commonality and makr the differences clear

Imagine two sibling directories of yaml, one for "production" and one for "staging" and where the two a largely the same, but with small differences. Your database just doubled in size, and got even more denormalised.

The problem is really one of data integrity and tersness.

The Solution

yaml won't be changing. It represents structured data well enough.

What we need is a method of generating the data.

But we need a DSL for generating the "instructions layer".

That something else needs to be "computation". We're going to need some computation to go with our data.

If we can't have, say, referential integrity in our yaml "database", perhaps instead we could have some "higher level" layer of computation which generates the yaml and have that computation clever enough to generate correct references in the yaml?

Perhaps your brain and fingers supply this computation via an editor? Is that reliable computation? Sadly, probably not.

Maybe the computation involves python driving a Jinja2 template and spewing forth a directoy-full of nicely generated yaml? That might work. Armed with python we can achieve anything - it is brutishly Turing Complete. But is it too big of a weapon? Might I shoot my foot off?

So, just to be clear: computation must be added. What we don't know is "how" and "how much"?

A Language And A Compiler

Any sort of computation is going to involve "a language" - a DSL. And then you'll need an interpreter/compiler to process instructions in your new language to produce final yaml.

So, for example, we could invent yaml++ which is almost identical to yaml but it has the ability to have, say, vars and interpolation. That's not much computation. But our new yaml++ will be kinda isomorphic to yaml and so pretty easy to learn. If you understand yaml you'd already understand 90% of our new yaml++.

You'd run your compiler across yaml++ files to produce the yaml files you want (and then, later, feed it to kubectl)

But, sooner or layer, you'll end up unhappy about the lack of power in yaml++ and decide to add an if construct so you can conditionaly include/exclude yaml fragments, etc.

Still not enuogh computation? Maybe add loops.

And so we go. More powerful computation. More features in our language, more power. More to complicated. More to learn. Less isomorphic to yaml itself.

It takes a lot of insight and quite a few iterations to get a DSL right. What elegent mix of operators, delivers enough expresivity to get the job done, but no more? After all, none of us want to be learning another regex or SQL just to get some yaml produced.

If fact, stop. Why are we even doing this? Why even invent another language? Just use whatever language you normally use, to produce the yaml, right? Sure, we can do that but a Full Turing language is the caveman, brute force approach which works everytime, but it does bring its own complexity.

I love the smell of tradeoffs in the morning.

Solutions

Let's look at a few specific solutions.

Possible Solution #1

the docker-compose CLI allows you to compose multiple yaml files
you supply a list of yaml files on the command line (via -f) in a certain order
it does something of a "ordered, deep merge" (a union) of the yaml files given but it applies the rule that values in "later" yaml files override those provided earlier on the command line.
so the CLI tooling itself supplies the "computation"
and that computation is just "merging" (there's no ifs of `loops)
in theory this approach allows you to compose "a base" (common) yaml with a specific yaml for, say, developmet, or CI.
so that's pretty limited, but it is something
except, the implementation sucks because certain things, like ports can't be "overridden", only ever accreted. Uggh.

Possible Solution #2

So now we get to real world yaml++ solutions.

In this category, a language (DSL) is provided together with "a compiler" which can process the language into yaml.

The DSL has is "just powerful enough" to get the job done, but no more powerful than necessary. The compiler is a pure function and effects other than outputing yaml are eliminated, which leads to safety and simplicity.

Run the processor once with a certain args to create all your development yaml. Then run it again with slightly different args to produce your production yaml.

Then check that generated yaml into your repo maybe. Or make generating it part of your CI/CD process.

Example DSL:

Dhall https://dhall-lang.org
Jsonnet https://jsonnet.org/ (JSON oriented, rather than yaml)

There are a number of options here. I choose two. Google is your friend.

Possible Solution #3

Helm is currently the defacto DSL for templating yaml in K8s.

You express source yaml as "parama

It is similar in intent to Dhall, or Jsonnet but it's DSL is less powerful. It is much more yaml isomorphic so its easier in that respect. You'll be using yaml to template yaml (mostly).

That means you kinda create yaml files with "holes" where values should be, and then you supply those values from other yaml files. Except, you can do a bit more than that - you can do if then else inclusion of fragmets, etc.

Helm is much more than yaml templating. It is "a package manager" too. Plus there's an agent part called Tiller which ... uggh. A messy mismash of conerns? Personally I don't like it.

But it is the defacto standard. So what do I know? Some people obviously like it, and it has enjoyed momentum.

They are doing a rewrite of Helm ATM and they plan to introduce Lua for the computational/templating part. That will make it more powerful. But will it still be clumsy? Jury is still out for me.

Possible Solution #4

Kustomize. The new 600 pund gorilla on the K8s block? The Helm replacement?

A new templating feature was recently built into kubectl > v1.14

i think you can use it like this: kubectl kustomize <dir> | kubectl apply -f -

I'm confused about the status of this project: https://gravitational.com/blog/kubernetes-kustomize-kep-kerfuffle/

I can see plenty of activity here: https://github.com/kubernetes-sigs/kustomize

How to do it: https://learnk8s.io/templating-yaml-with-code

Possible Solution #5

The Turing complete approach is more "code as configuration".

This is the webpack approach. Or Django settings.py. Very powerful. Roll your own. Do anything. You'll be in danger of doing too much. You have the advantage of using your existing, familiar toolset (language), it means no extra language/tooling overhead.

For example, we could use Python with Jinja2 templating to produce our yaml.

Again, you would run a process (python) and it would produce a complete set of yaml files for production or development or CI. And that process might also produce an nginx conf file. You can do anything.

Possible Solution #6

Don't even yaml.

After all, yaml is only created in order to feed kubectl, like this:

kubectl apply -f some_generated.yaml

and kubctl just turns the yaml into JSON anyway, and then it talks with the Kubernetes Master API Server with a POST. So maybe just cut out kubectl completely and instead use a more direct tool to talk with the API Server, via JSON, and forget all the generating yaml palava.

In which case, perhaps use python and use pulumi: https://pulumi.io/quickstart/kubernetes/index.html

This is a Turing Complete path.

This path competes with terraform.

Intro Video: https://www.youtube.com/watch?v=QfJTJs24-JM&feature=emb_logo

Interesting Links

Current state:

https://blog.argoproj.io/the-state-of-kubernetes-configuration-management-d8b06c1205

Other:

Note: this problem is a bigger than Kubernetes. Its a general problem for "Infrastructure as Code" everywhere. All sorts of tradeoffs.

mike-thompson-day8/Kubernetes yaml templating.md