I’ve been working my way up the Kubernetes learning curve.
These notes are about yaml
.
- Using Kubernetes (K8s) involves many
yaml
"manifests" yaml
is a thoughly modern way to do "infrastructure as code" (IaC)- But with K8s there's a lot of it
- So, I find myself confronted with "yaml inflation"
yaml
is data.yaml
is "declarative" - there's nothig imperative about data- declarative is good right? Everyone loves declarative.
- While it is true that yaml handles the simple case well (tutorials and all that),
we also want to judge it by how well it tames the complexity of more difficult problems
and, by this measure, the pure
yaml
approach sucks. - These notes explore "why?" and what to do about it
- Yes, it sucks to be a delevoper confronted with too much yaml but, just to be clear, there's good reasons why "just data" is a good decision at a certain level of these technical stacks
- Infrastructure systems like K8s and terraform work by comparing the "desired state" of a system with the "currently observed state" of that system, and their job is to then alter the "observed state" to match the "desired state"
- And the "desired state" has to be represented as data and be put in a datastore of some kind, etcd or an S3 bucket, etc.
- So, to a certain level of that technical stack "declarative, simple data" is nice
- But, what works well at one level does not work at another level, and lots of
yaml
is quickly bad for us at the developer level.
yaml
is data at rest. Lovely, pure data.
If this data conforms to a given structure it specifies a Kubernetes Cluster. This means there's a DSL and an interpreter. This pure data format is a DSL which allows you to encode "instructions" to be "interpreted" by the Kubernetes cluster. So, as you write this conforming yaml
, you are writing "a program" for the Kubernetes Control plane to execute.
XXX discuss assembler instructions. XXX discuss how writing assembler is tedious. Most people choose to write in a higher level language.
That's one perspective. But there's another ...
Many related yaml
manifest files sitting in a directory collectively form
a denormalised database with no referential integrity. This is where things start to suck.
Referential Integrity:
- sibling
yaml
files are likly to share common symbols - for example, the label for a Pod in a
deployment.yaml
might also used for "selection" in an associatedservice.yaml
- but the integrity of these references is not enforced, not at the level of
yaml
, not the way a relational database would enforce it. - one of the files could be editted and the symbol changed, but not the other, so now one contains the wrong "reference". Its busted but we don't know.
Denormalised
- database normalisation is the eqivalent of programming's DRY
- imagine two pieces of
yaml
which are almost the same, except for subtle variations. In production, one of the Pods uses a smaller sized machine, but in production a larger one. Other than that, the two are the same. - again, imagine two directories containing
yaml
which is smilar - one for "prod" and one for "staging" - if this was a relational database, we'd have ways to normaise/factor out the commonality and makr the differences clear
Imagine two sibling directories of yaml
, one for "production" and
one for "staging" and where the two a largely the same, but with small differences.
Your database just doubled in size, and got even more denormalised.
The problem is really one of data integrity and tersness.
See also yaml probably ot so great
yaml
won't be changing. It represents structured data well enough.
What we need is a method of generating the data.
But we need a DSL for generating the "instructions layer".
That something else needs to be "computation". We're going to need some computation to go with our data.
If we can't have, say, referential integrity in our yaml "database", perhaps instead we
could have some "higher level" layer of computation which generates the yaml
and
have that computation clever enough to generate correct references in the yaml
?
Perhaps your brain and fingers supply this computation via an editor? Is that reliable computation? Sadly, probably not.
Maybe the computation involves python
driving a Jinja2
template and spewing forth a directoy-full of nicely generated yaml
? That might work. Armed with python we can achieve anything - it is brutishly Turing Complete. But is it too big of a weapon? Might I shoot my foot off?
So, just to be clear: computation must be added. What we don't know is "how" and "how much"?
Any sort of computation is going to involve "a language" - a DSL. And then you'll need an interpreter/compiler to process instructions in your new language to produce final yaml
.
So, for example, we could invent yaml++
which is almost identical to yaml
but it has the ability to have, say, vars
and interpolation
. That's not much computation. But our new yaml++
will be kinda isomorphic to yaml
and so pretty easy to learn. If you understand yaml
you'd already understand 90% of our new yaml++
.
You'd run your compiler
across yaml++
files to produce the yaml
files you want (and then, later, feed it to kubectl
)
But, sooner or layer, you'll end up unhappy about the lack of power in yaml++
and decide to add an if
construct so you can conditionaly include/exclude yaml
fragments, etc.
Still not enuogh computation? Maybe add loops.
And so we go. More powerful computation. More features in our language, more power. More to complicated. More to learn. Less isomorphic to yaml itself.
It takes a lot of insight and quite a few iterations to get a DSL right. What elegent mix of operators, delivers enough expresivity to get the job done, but no more? After all, none of us want to be learning another regex
or SQL
just to get some yaml produced.
If fact, stop. Why are we even doing this? Why even invent another language? Just use whatever language you normally use, to produce the yaml, right? Sure, we can do that but a Full Turing
language is the caveman, brute force approach which works everytime, but it does bring its own complexity.
I love the smell of tradeoffs in the morning.
Let's look at a few specific solutions.
- the
docker-compose
CLI allows you to compose multipleyaml
files - you supply a list of yaml files on the command line (via
-f
) in a certain order - it does something of a "ordered, deep merge" (a union) of the
yaml
files given but it applies the rule that values in "later" yaml files override those provided earlier on the command line. - so the CLI tooling itself supplies the "computation"
- and that computation is just "merging" (there's no
ifs
of `loops) - in theory this approach allows you to compose "a base" (common)
yaml
with a specificyaml
for, say, developmet, or CI. - so that's pretty limited, but it is something
- except, the implementation sucks because certain things, like
ports
can't be "overridden", only ever accreted. Uggh.
So now we get to real world yaml++
solutions.
In this category, a language (DSL) is provided together with "a compiler" which can process the language into yaml.
The DSL has is "just powerful enough"
to get the job done, but no more powerful than necessary. The compiler is a pure function and effects other than outputing yaml
are eliminated, which leads to safety and simplicity.
Run the processor once with a certain args to create all your development
yaml. Then run it again with slightly different args to produce your production
yaml.
Then check that generated yaml into your repo maybe. Or make generating it part of your CI/CD process.
Example DSL:
Dhall
https://dhall-lang.orgJsonnet
https://jsonnet.org/ (JSON oriented, rather than yaml)
There are a number of options here. I choose two. Google is your friend.
Helm
is currently the defacto DSL for templating yaml in K8s.
You express source yaml as "parama
It is similar in intent to Dhall
, or Jsonnet
but it's DSL is less powerful. It is
much more yaml
isomorphic so its easier in that respect. You'll be using yaml
to template yaml
(mostly).
That means you kinda create yaml
files with "holes" where values should be, and then you supply those values from other yaml files.
Except, you can do a bit more than that - you can do if
then
else
inclusion of fragmets, etc.
Helm is much more than yaml templating. It is "a package manager" too. Plus there's an agent part called Tiller which ... uggh. A messy mismash of conerns? Personally I don't like it.
But it is the defacto standard. So what do I know? Some people obviously like it, and it has enjoyed momentum.
They are doing a rewrite of Helm ATM and they plan to introduce Lua for the computational/templating part. That will make it more powerful. But will it still be clumsy? Jury is still out for me.
Kustomize. The new 600 pund gorilla on the K8s block? The Helm replacement?
A new templating feature was recently built into kubectl
> v1.14
i think you can use it like this:
kubectl kustomize <dir> | kubectl apply -f -
I'm confused about the status of this project: https://gravitational.com/blog/kubernetes-kustomize-kep-kerfuffle/
I can see plenty of activity here: https://github.com/kubernetes-sigs/kustomize
How to do it: https://learnk8s.io/templating-yaml-with-code
The Turing complete approach is more "code as configuration".
This is the webpack
approach. Or Django settings.py
. Very powerful. Roll your own. Do anything. You'll be in danger of doing too much. You have the advantage of using your existing, familiar toolset (language), it means no extra language/tooling overhead.
For example, we could use Python
with Jinja2
templating to produce our yaml
.
Again, you would run a process (python) and it would produce a complete set of yaml files for production or development or CI. And that process might also produce an nginx conf file. You can do anything.
Don't even yaml.
After all, yaml
is only created in order to feed kubectl
, like this:
kubectl apply -f some_generated.yaml
and kubctl
just turns the yaml
into JSON anyway, and then it talks with the Kubernetes Master API Server with a POST. So maybe just cut out kubectl
completely and instead use a more direct tool to talk with the API Server, via JSON, and forget all the generating yaml
palava.
In which case, perhaps use python and use pulumi
:
https://pulumi.io/quickstart/kubernetes/index.html
This is a Turing Complete path.
This path competes with terraform.
Intro Video: https://www.youtube.com/watch?v=QfJTJs24-JM&feature=emb_logo
Current state:
Other:
- https://news.ycombinator.com/item?id=19108787
- https://blog.argoproj.io/the-state-of-kubernetes-configuration-management-d8b06c1205
- https://gravitational.com/blog/kubernetes-kustomize-kep-kerfuffle/
- https://habr.com/en/post/437682/
- https://kubernetes.io/blog/2018/05/29/introducing-kustomize-template-free-configuration-customization-for-kubernetes/
Note: this problem is a bigger than Kubernetes. Its a general problem for "Infrastructure as Code" everywhere. All sorts of tradeoffs.