Kubecon 2018 Diary

[email protected] / unfiltered / limited distribution

2018-12-11

Keynote

Nothing unexpected. That zoo story was weird.

Getting the Most Out of Kubernetes with Resource Limits and Load Testing

Limits and Requests covered

QoS levels: Guaranteed vs Burstable vs Best Effort

Optimization

Underallocation: kubelet kills pods due to exceeding limits. Easy to detect.
Overallocation: Wasted resources = Wasted spend. Trickier to detect (@dlswense). Especially tricker on deployments with many replicas- small amounts of overalloction are multipled by the number of replicas. 500m overallocation times 40 pods = 20 CPUs!

Process for setting limits intelligently:

Start with very conservative limits
Load test with limits, watching for killed pods
Change one thing at a time between each test
Use short duration ramp up load tests to quickly find bottlenecks and coarsely tune
Use long duration load tests to find memory leaks, saturation issues etc

Terminal tool for live metrics: Kubescope

Load test tools: loader.io, locust.io

Autoscaler input: Azure Adapter

Figure out how your stuff breaks before you go into production (Shared load test infrastructure and tooling?)

Horizontal Pod Autoscaler - Adjusts replica count

Vertical Pod Autoscaler - Changes limit dynamically (restarts pods)

!!! Possible to combine HPA and VPA- hook up VPA to CPU/mem and hook up HPA to other metrics such as network, rps and disk

Behind your PR: Running k8s CI on k8s

k8s is already fault tolerant and provides CI features

k8s-ci-robot runs on k8s and automates ci commands via GitHub comments

Containerized build and test (k8s-infrastructure to the logical end)

Tests are defined by CRDs

No master/agent, no job queue- just create CRDs and let k8s distribute them

Controllers handle triggers (git hooks, nightly crons)

CRDs for resource allocation and cleanup (@crowther @fuller)

Look man it's CRDs all the way down

BELOW THIS POINT LIES MADNESS

There's stuff about managing configmaps via controllers too but that's not a problem at our scale

Oh god now they're running k8s in k8s with docker on systemd in docker... with docker in docker for pods

this is insane

Autoscaling on Latency

entered late due to dumplings. totally worth, thanks @bcook and @matthsmi

Seems straightforward. Prometheus metrics as input for HPA. In this case latency metrics from linkerd.

We should strongly consider making latency a default inputs for autoscaling on Ethos. RPS can vary by pod but we have good general guidelines for default latencies. (Unless tenants do silly stuff like a request that takes over 9000ms to return.) (Or if tenants have a database latency issue and then we exacerbate the problem by creating more pods that create more connections and continue to DOS the DB.) (This isn't sounding so easy anymore.) (But we should still do it.) (I haven't put these many parens on the same line since Discrete Structures in college.) ('https://xkcd.com/297/'))

Predictive scaling! Linear regression specifically. Can lower scaling reaction times for highly burstable workloads, in case you need a faster reaction time. E.g. for AEM, the high end start time may be over 10+ minutes for new nodes.

Debugging etcd

etcd is backed by BoltDB and exposed via gRPC and HTTP

keys are lexically ordered which means that RANGE queries are very efficient. E.g. "All replicasets in a namespace" is efficient

Quirk: Nodes are called "minions" in the etcd data

Linearised consistency - as long as you wait for responses you are guaranteed a logical view

Streaming (WATCH) - gRPC streams for events. Watches are eventually consistent since you are reading from a particular member.

Each kv has a global revision number, and the history of each kv is kept. (copy on write) Object history is compacted to remove old history. Everything is kept by default in etcd, k8s requests compacting of data older than 5 minutes.

Defragmentation is a "stop the world" operation to shrink the boltdb content. Must be manually triggered (etcd cull)

etcd data dir

member/snap/db - where reads come from
member/wal - write ahead log. Writes happen there, and is the place where the fsync happens. Data is then also written to the boltdb for reads.

regular snapshots - Used for recovery of wal when required

etcdctl is your best friend when everything is on fire!

Note that individual objects are binary protobufs- use auger to deserialize k8s protobufs. Can be used for single object restores if Ark is broken.

etcd-dump-logs to see WAL history

etcdctl endpoint status - weak but convenient health check. /health in API is a stronger check since it actually performs a write.

etcd member list - another weak but convenient check.

when you see 'timestamp C | logline' in etcd logs, the C is for crash and it is very bad

Troubleshooting workflow

Disk quota issues: Detect with etcdctl endpoint status, or check size of member/snap/db, or etcdctl alarm list shows NOSPACE alarm, or database space exceeded error from api. Resolution: Compact, then defrag if needed. Disarm the alarm when done.

Timeout/latency issues: 5s timeout by default. Check request size, cpu, memory. Use tracing tools. Check disk performance- wal_fsync_duration_seconds, backend_commit_duration_seconds in Prometheus. Check network. Check leader election frequency. (@kirharri @colsen)

Tracing apiserver requests: docker logs <apiserver id> | grep "total time" -C 5

Three dimensions for inspecting data: object count, object size, and object revisions. Revisions is a subtle one. There may be a lag time for old revisions to be compacted, so when addressing space issues make sure you wait for old revisions to compact (or initiate compaction!)

Workload scales data volume due to both new revisions and increses in DB fragmentation

More objects -> higher latency on range reads

Death spiral as large requests cause latency of other requests to increase

Need metrics on object counts?

etcd downgrades are only supported when restoring a backup. WIP to support going back 1 minor version. (@vsethi @rgarg @fuller this impacts rollback strategies for k8s upgrades) (@khehl @cmason we'll need a new CMR template for major upgrades that is higher risk)

need to snapshot before upgrades and rollbacks! also hourly

improvements coming to resolve issue when too many pods with too many heartbeats

Large compacting can be expensive.... let k8s handle it for the most part. ideally in recent etcd version defragging is not required

Rightsizing Pods with VPA

Started with a review of requests and limits and the overview of VPA

VPA CRD

update policy - off, initial, auto (experimental)
- off: recommend only
- initial: Change new pods but never delete running pods
- auto: Delete running pods to force resized pods
resource policy - in ethos set max resources to node size * 0.5

Let's start deploying this in "Off" mode to gather data. We can then take the VPA recommendations vs actual usage for reporting even for workloads where it is disabled. (@dlswense)

Three binaries in VPA:

VPA Recommender - determines new requests
VPA Admission Plugin - mutates requests
VPA Updater - Restarts pods to trigger resizing

Recommendation model is based on Borg - aggrgates utilization into a histogram and takes the max usage in the last 8 days.

Is there a rate limit to frequency of restarts?

Upstream efforts to change resource requests without restarting pods (nondisruptive on shrink)

Doesn't support setting limits yet, only requests

Doesn't work with HPA with cpu and memory, but can work alongside HPA with custom metrics

Vendor Booths

Heptio

Talked about sticky sessions support in Contour, and more generally, Contour support for Envoy features. IngressRoutes are trying to address more of that without the huge block of annotations. Talked about possibly contributing the feature ourselves.

Talked about encryption support in Ark. Encryption is loosely on the roadmap but not yet defined or committed to a date/release. Invited to Heptio design meetings.

Kata

Q: How do you score a Kata sweatshirt?

A: Walk up to a booth and say "Hi, we're from $ENTERPRISE_CORP. We have $SPECIFIC_USE_CASE for your product, so we compiled it and got it running as a PoC. You're on the roadmap for evaluation next quarter."

They asked Jacob to post his CoreOS notes in their Slack, and also suggested some easier ways to install Kata based on some stuff they experimented with on their GitHub.

Azure

Talked to AKS PMs about process isolation. They're not using user namespace remapping at all, since they don't support hostile tenant environments. Their process isolation model is currently non-existent; their suggestion was to use Kata/ACI for isolation.

2018-12-12

Woken up at 5:33 by a call... Nginx must DIE

Keynote

Aqua demo: We should totally add admission controller to prevent creating cluster admins, namespaces, etc. outside of git or the onboarding api

Airbnb demo: FD/Moonbeam/DxE teams, this is a must watch, airbnb is doing the same stuff ethos wants to do

Intro to SPIFFE

"SPIFEE delivers trusted identities to software systems"

Doesn't implement TLS/tokens, but provides the inputs for those systems to person authentication and authorization

Goals:

Move beyond network/VPC/IP identities and shared secrets
Support heterogenous environments (IAM/MSI incompatiblity)

SVID: issued identity document. Currently issued by SPIRE (SPIffe ~~REference~~ Runtime Environment), in future can be issued by other implementations like Consul and Istio. Identified by a URL-like string, e.g spiffe://trust.domain/workload/identifier. Encoded via x509 bundle or JWT (mostly x509 in practice)

Federation allows different SVID issuers to trust each other's documents

Workload API for retrieving SVID documents. Pods call the workload api to get their SVID. SPIRE provides various plugins to handle the workload API auth depending on the available infrastructure.

How to integrate SPIFFE/SPIRE?

Libaries available for direct integration in code
Use Envoy integration to support without code changes
spiffe-helper tool for simple testing/saving SVID to disk
Service Mesh integrations

Isovalent (formerly Covalent) (Cilium)

ethos deployment status discussion

adobe.io service mesh exploration discussion

new features in 1.4 - dns proxy, transparent encryption (ipsec)

multicluster transparent failover for service 2 service

discussion of layering Cilium CNI over EKS

Predictive Scaling with Prometheus and Machine Learning

Make scaling decisions on metrics

Don't want to burden SREs with manually finding scale-down

Also want to scale up in advance (10-15 minutes ahead) for responsiveness for data science workloads

Solution: time-series forcasting for anticipating scaling needs (higher redundancy during peak, scale down off-peak)

Reached out to data scientists to evaluate data: Long Short Term Memory

Stages of implementation

Collect metrics (in our case via Prometheus)
Build model (Data science voodoo goes here)
Make predictions (More voodoo here)
Make decisions (in our case, add and remove pods for spark/tensorflow)

Collect data from Prometheus - large but not too large data range, batch/paginate queries as necessary. Want a representative sample of the current app for training. Data from old app versions less valuable.

Example Metrics:

Replica counts
Requests per second
Latency
Queue length

Decisions should have lower and upper bounds. Don't wanna blow past scaling limits and hand MS a bunch of moneiz. Also nee to handle capacity issues :)

Future work: Defining a CRD for Predictive Auto Scaler, so you can run a PPA just like a VPA or HPA

Surviving PCI/HIPAA

Segregate data classification via taints and tolerations/affinity. (In Ethos we just make the whole cluster compliant....)

Network policies - speaker didn't know you can do cilium policies without cilium CNI...

Traditional antivirus sucks in cloud native

Image scanning is important during both build and runtime

OpenID for authentication

Auditing, logging, monitoring

This talk is mostly stuff we already do lol

Troubleshooting On-Prem Kubernetes Networks

Awww, this talk is about iptables setups, don't think they'll talk about bpf :(

CNI failure modes:

plugin/config failure (container cannot start)
config valid but network failed (container started but could not reach network)

identification of container interface

Use nsenter to enter containeer namespace

nsenter -t <pid> -n -- <command>

how to get container PID: kubernetes pod status -> container status -> container id -> docker inspect -> pid

ip -d link- @if<number> in name indicates the number of the interface on the other side

buncha stuff about iptables debugging that doesn't apply to us

wiretap tool for packet capture: https://github.com/redhat-nfvpe/kokotap like tcpdump/tshark but targets a pod

Two Sigma - Operating Multi-Tenant and Secure Clusters

workaround for user namespaces: unix account per namespace, admission controller to set runAsUser

most of this is crazy workarounds for on-prem clusters using Kerberos

e.g. distributing large container images by syncing via Ceph instead of pulling on demand

chaos engineering by default

2018-11-13

Keynote

Distributed Tracing

Metrics and Logging are from the perspective of one application, tracing is from the perspective of multiple apps

Request IDs work okay for smaller service counts but are difficult for tracing through 100s of services

Distributed tracing provides consistent tracing mechaisms for all moving parts, including ingress and routing

Challenges:

Polyglot support
Don't want to burden product engineers
Hard to instrument legacy code
Consistent correct implementation

Solution: OpenTracing CNCF project

Jaeger - Have to wrap routes in code with a tracing context :(

Envoy sidecar - No code changes required, injects tracing id header and collects trace data

Integrate trace links into grafana dashboards and pagerduty alerts

Kubernetes Scaling Multi-Dimensional Analysis

WATCH THIS TALK

Math-heavy analysis of scaling limits targeted for 1000s of nodes

Need to remain within a safety envelope based on many dimensions (n-dimensional space hyberkube!)

The dimensions are not always independent, and they are not linear.

Dimensional limits A = 5000 and B = 5000 does not mean (A, B) = (5000, 5000) will work

!!! Dimensions "taper" at the extremes. Don't push too many dimensions at the same time! A 5 node cluster can have scaling problems if other dimensions are stretches.

The main bound on the envelope is etcd, scaling VERY roughly with etcd disk quota. See also the etcd debugging talk notes.

Dependent dimensions of the envelope make up lower-dimensional simpler envelopes

The dimensions and scaling limits of the platform are not exhaustively identified, and these limits may be lower for particular implementations

Generally there are configs that are probably safe versus pushable versus not known possible, but you can even fail within the "safe" envelope in certain ways

Number of nodes (5k) - apiserver fails
Number of pods: (150k) - higher not tested or supported
Pods per node (pod density) (110/node) - kubelet fails. May occur higher with multi-container pods.

However, the (nodes, density) relationship is a concave curve. At 5k nodes you can only have around 30 pods per node You're only gonna hit a density of 110 in a smaller cluster.

Number of Services (10k) - iptables degradation (BPF should help here)
Number of backends (50k)
Backends per service (250) - quadratic traffic overload

Once again, the relationship between backends and services is a curve. Backends per service starts to drop after around 200 services.

Services per namespace (5k) - pod crashes due to too many variables.

Curved relationship between number of namespaces and services. As you add more namespaces you can have fewer and fewer services in each, down to single digits.

Pod churn(CRUD ops on pods) 20/s hard limit in controller manager, plus garbage collection. control plane failure above this limit

Number of nodes vs (Secrets + ConfigMaps) per node, not an issue in 1.12+ where kubelet config reads were moved from GET to WATCH

Max 3k pods per namespace, decreases as number of namespaces increases, down to 15 @ 10k namespaces.

Service Meshes Production Readiness

Talk targeting "small, innovative teams" at SMBs

It's ok not to use a service mesh when there isn't a specific problem to solve

added complexity for both ops teams and for apps. Proxy in front of apps can cause weird behaviors.

"Don't be Equifax"

Open Source Litmus Test:

how long has it existed
popularity of contribution
popularity of usage
sponsorship
change activity

Linkerd2 - simple, basic mesh Istio - "literally everything else"

They did the hardest service first (their entrypoint monolith). "Edge case factory"

Dev buy in is easiest when workflow is not interupted

Focus on the components you need

Instrument metrics early

Open Tracing is essential (Jaeger in this case)

Amazon CNI TLS CA will drive you nuts with VPC peering

Constant load testing

Certmanager for TLS

Talk was well-presented! Though a bit basic for us

dharmab/kubecon.md

Kubecon 2018 Diary

2018-12-11

Keynote

Getting the Most Out of Kubernetes with Resource Limits and Load Testing

Behind your PR: Running k8s CI on k8s

Autoscaling on Latency

Debugging etcd

Rightsizing Pods with VPA

Vendor Booths

Heptio

Kata

Azure

2018-12-12

Keynote

Intro to SPIFFE

Isovalent (formerly Covalent) (Cilium)

Predictive Scaling with Prometheus and Machine Learning

Surviving PCI/HIPAA

Troubleshooting On-Prem Kubernetes Networks

Two Sigma - Operating Multi-Tenant and Secure Clusters

2018-11-13

Keynote

Distributed Tracing

Kubernetes Scaling Multi-Dimensional Analysis

Service Meshes Production Readiness