Skip to content

Instantly share code, notes, and snippets.

@luebken
Last active November 12, 2024 16:22
Show Gist options
  • Save luebken/177ef4d0b5ee4235232cf7eefd1ef0aa to your computer and use it in GitHub Desktop.
Save luebken/177ef4d0b5ee4235232cf7eefd1ef0aa to your computer and use it in GitHub Desktop.
Cloud Native Rejekts NA 24 | Flex Room | Day 1

Cloud Native Rejekts NA 24 | Flex Room | Day 1

Kubernetes Volume Populators

  • Kubernetes volumes are often empty when provisioned, but there are cases where they can be created with data using another PVC or a volume snapshot as the source (00:08:48).
  • Volume populators are custom resources and controllers that allow for the dynamic provisioning of Kubernetes volumes with pre-existing data from any source (00:10:43).
  • Use cases for volume populators include running virtual machines on Kubernetes, data management for stateful applications, and generic HTTP population of Premature ventricular contraction (00:11:13).
  • Volume populators can be registered on a Kubernetes cluster by creating a custom resource definition (CRD) and an instance of the volume populator resource (00:15:34).
  • The generic HTTP populator is a type of volume populator that can populate a PVC with data from a remote endpoint (00:14:24).
  • A custom controller, called the populator controller, is created to handle PVC requests when the external provisioner or CSI does not know what to do with the PVC, and it gets stuck in a pending state (00:18:21).
  • The populator controller listens to create, update, or delete events of PVCs and creates a temporary pod, called the populator pod, to download or populate data into the PVC (00:18:48).
  • The populator pod is responsible for downloading data from an HTTP endpoint and writing it to the PVC, and once the data is written, the populator controller updates the reference in the persistent volume to point to the application PVC (00:20:41).
  • Everything apart from what runs in the populator pod is done by a library maintained by upstream folks, and only the code for the populator pod needs to be written (00:21:42).
  • The PVC API has two fields, data source and data source ref, which must always point to the same resource, and these fields are immutable (00:23:23).
  • The data source field initially supported only two types of resources, but a new field, data source ref, was introduced to allow configuration of any resource (00:23:26).
  • To install the volume populator, a custom resource definition (CRD) needs to be created to introduce the volume populator resource in the cluster (00:26:29).
  • After creating the CRD, a volume populator resource needs to be created to register the new CRD as a volume populator (00:28:40).
  • A populator controller and a populator pod are used to dynamically provision a PVC with data from a generic HTTP endpoint, allowing for flexible data management in Kubernetes environments (00:32:11).
  • The populator controller creates a temporary pod and PVC, mounts them, and then deletes them after the data is populated, while the populator pod writes the data into the mounted PVC (00:32:57).
  • The solution is useful for applications with dynamic data sources, such as virtual machines or database management software, where the data needs to be restored from a dynamic source (00:36:51).
  • Limitations of the solution include the inability to create the application PVC in the same namespace as the controller, and the need to configure image pull secrets for the populator pod (00:34:16).
  • The populator pod can generate events for the application PVC to communicate any issues with population, errors, or delays (00:37:40).

Orchestrator Comparisons and Insights

  • The speaker's experience working at Walt Disney Animation Studios led to an interest in learning from other orchestrators and reading white papers to understand the intentions and implementations of different systems (00:43:48).
  • Apache Mesos was built for HPC workloads in non-HPC environments, allowing for High-performance computing in environments without high-speed interconnects (00:46:27).
  • Mesos has a minimal layer that gathers information from nodes and allows frameworks to scale independently, with a two-level scheduler that enables frameworks to be opinionated about resource allocation (00:47:50).
  • Mesos was able to scale to 50,000 nodes in a single cluster, but was limited by the throughput of Amazon Elastic Compute Cloud at the time (00:48:30).
  • Borg is a platform that breaks up clusters into cells and allows for mixing of workloads, including HPC and long-running jobs, in the same cluster (00:51:52).
  • Borg has a tiering system for workloads, but also allows for over-prescription of job priority, with a focus on providing infinite capacity at the lowest tier of priority (00:52:21).
  • Borg's master runs an in-memory database, allowing for faster scaling, but also has a snapshot system for replication in case of failure, which can result in some downtime (00:53:21).
  • In Kubernetes, networking complexity arises when workloads change, leading to difficulties in managing IP addresses and ports, which is why flat networking is preferred (00:54:27).
  • Google's Borg system used a command-line tool called Borg Secure Shell to connect to workloads, which was an improvement over traditional SSH methods (00:54:50).
  • Twine, an orchestrator developed by Meta Platforms, uses sharding to scale its API server and database, allowing it to manage large clusters of up to a million machines (00:55:52).
  • Twine's host profiles optimize node OS and kernel settings for specific workloads, resulting in a 17% reduction in total cost of ownership and an 11% increase in throughput (00:58:13).
  • Twine has two types of schedulers: an allocator for fast job scheduling and a rescher for long-term optimization, which can override the scheduler to reduce cluster usage by 10-20% (00:59:57).
  • Twine's batch job scheduling allows it to allocate a handful of jobs and reuse the same job section, eliminating the need for rescheduling and reallocation, resulting in increased throughput (01:01:29).
  • Facebook's Twine is not Kubernetes, does not do federation, and has a sharded in-region federated cross-region notion, allowing for global services and scalability, (01:02:35)
  • Docker (software) was a single binary approach, allowing for easy setup and management of a cluster, but lacked a "hard way" to learn and understand its inner workings, (01:03:42)
  • Nomad is a single binary orchestrator that allows for multi-region clusters, gossip-based discovery, and transparent workload discovery across regions, (01:05:29)
  • ECS has a versionless API, allowing for backwards compatibility and flexibility in upgrading and changing clusters, (01:07:22)
  • The ideal orchestrator does not exist, as all solutions involve trade-offs and design decisions, and the best approach depends on specific needs and requirements, (01:10:03)
  • Kubernetes' flexibility can be a double-edged sword, making it hard to use due to the numerous options available, and its beginning point has always been challenging because of its flexibility to do everything (01:10:46).
  • Nomad and Docker Swarm are examples of systems that restrict certain things to make it easier to run over time, and Kubernetes could potentially move away from being container-focused in the future (01:11:36).
  • The problem with Kubernetes schedulers is that they don't have enough opinions, being too generic, and it would be beneficial to have more smaller, opinionated, and not generic things (01:16:18).
  • Having some opinions about what a platform can and cannot do is beneficial, as seen in the case of ECS, where telling people "no" and having a more focused platform led to 70-80% of Disney+ running on their ECS clusters (01:15:27).

eBPF: A Deep Dive

  • Mato, a Cloud Native Computing Foundation Ambassador and Kubernetes contributor, is giving a talk about EBPF, starting with some fun facts and introducing himself as a metal singer with a band called Scapegoat. (02:57:52)
  • Mato mentions that he used to live with his mom until he was 18 and his grandma brought him up, and he's going to explain eBPF in a way that his grandma could understand. (02:58:21)
  • Mato explains that he has experience with eBPF tools like Falco, Cilium, and Calico, but didn't understand what was going on under the hood until he had to debug issues. (03:00:52)
  • Mato asks the audience how many use eBPF or eBPF-powered tools, and says that his talk is a 101 eBPF class. (03:02:20)
  • Mato explains that the kernel is a bridge between user processes and resources, and uses an analogy of a kitchen to describe how the kernel works, with eBPF being a game-changer for managing the kitchen. (03:03:48)
  • Mato explains that the kernel handles tasks like managing memory, orchestrating time slices for CPU processes, and handling system calls, which is what eBPF focuses on. (03:05:02)
  • EBPF (extended Berkeley Packet Filter) is a technology that allows running user space programs or kernel space programs without writing kernel modules or altering the kernel, and it's extended from BPF, which was born in 1992 (03:06:40).
  • eBPF allows supervising kernel operations, monitoring processes, and ensuring that the right packages are received, and it lies inside the kernel with entry points outside, such as the Cisco scheduler (03:10:09).
  • eBPF has hooks that can be attached to various points, including processes, sockets, virtual file systems, and network devices, allowing for complex programs to be created (03:12:59).
  • eBPF has Maps, which can be used to store program state, configurations, and share data between programs, and it also has helpers, which are functions that can help with tasks such as generating random numbers and manipulating packets (03:14:04).
  • eBPF programs can be chained and reused, allowing for readable code and reduced overhead, thanks to function calls (03:16:29).
  • eBPF is used by tools such as Hubble and Cilium, which uses eBPF to capture traffic from pods in a cluster (03:14:41).
  • EBPF can be used for observability, security, and networking, allowing for the manipulation of packets, application of network policies, load balancing, and bandwidth management (03:16:55).
  • Falco uses an eBPF probe to ingest CIS data and make it flow into two different libraries for event parsing, filtering, and output formatting (03:17:56).
  • eBPF is considered a better approach than writing a kernel module or using sidecars, as it has less overhead and doesn't require maintaining additional components (03:19:35).
  • eBPF can be used for runtime security and networking, providing complete access at the lowest level and better performance compared to sidecars (03:25:22).
  • Cilium and Falco are examples of products that use eBPF for security and observability, with Falco being slightly better for security (03:26:46).

Distroless Images and Debugging

  • Distroless images are a type of container image that only includes the bare minimum to run an application, without a full Linux distribution, to reduce vulnerabilities and improve performance (03:33:49).
  • The Google Distroless project provides images for various applications, such as Node.js, with significantly reduced sizes and vulnerabilities compared to traditional images (03:36:25).
  • Chainuard's Wolfi-based images aim to provide a balance between minimalism and usability, with a package manager and a focus on security, and have been shown to have zero known vulnerabilities (03:38:00).
  • Debugging and troubleshooting distroless images can be challenging due to their minimal nature, but using a debug image with a shell can help (03:36:41).
  • The Chainuard images product is an evolution of the distroless concept, providing a more developer-friendly and usable experience while maintaining security and minimalism (03:37:50).
  • Wolfie is a project that rebuilds from source, including npm and node projects, to stay on top of updates and maintain zero Common Vulnerabilities and Exposures (03:39:03).
  • Alpine is another option for achieving zero CVEs, but some companies may not use it due to compatibility issues with musl libc instead of glibc (03:39:35).
  • Debugging in a distroless image can be challenging due to the lack of a shell, and Docker (software) debug or C debug can be used as alternatives (03:42:18).
  • Docker debug is a feature available in Docker Desktop Pro or Business tier, which allows for debugging in a NYX-based sidecar container (03:43:42).
  • C debug is a fully open-source project that provides similar functionality to Docker debug without requiring Docker (03:45:44).
  • Kubernetes provides a debug feature through kubectl debug, which allows for debugging in a busybox-based debug image (03:47:24).
  • Debugging in a Kubernetes environment can be challenging due to file system permissions, but using a debug container with the correct user ID can help, as demonstrated with the chainuard image (03:49:28).
  • cdbg supports user and privileged modes, allowing for remote connectivity into a cluster, but this relies on the cluster allowing cdbg to run (03:52:10).
  • The dive tool allows for inspecting the layers of an image and seeing the file system itself without starting a container, and gripe is used for scanning (03:53:28).
  • Chainuard's approach to vulnerabilities is to not include them in images in the first place, by building images from source using the Wolfie project and rebuilding them when vulnerabilities are fixed (03:55:41).
  • Debugging tools can be restricted in some shops due to security considerations, such as access to files and secrets, but allowing them in a sandbox cluster can be beneficial for developers (03:57:31).

Cluster Autoscaler vs. Carpenter

  • The discussion is about cluster autoscaler and Carpenter, with the speakers being David Morrison, a research scientist at Applied Computing Research Labs, and Michael McHan, a software engineer at Red Hat (04:07:55).
  • The talk covers an OpenShift cloud cost analysis, where three different configurations of OpenShift were created to compare the costs of running a cluster with the cluster autoscaler versus Carpenter (04:10:01).
  • The testing methodology used the Cburner project to run repeatable workloads in Kubernetes, with two phases of testing: homogeneous instance types and heterogeneous instance types (04:12:31).
  • The results of the data are presented, showing the costs of different runs with on-demand instances and Carpenter (04:15:09).
  • Cluster autoscaler performed well against Carpenter in cost savings when using heterogeneous instances, especially with spot instances (04:15:36).
  • The cost difference between cluster autoscaler and Carpenter decreased as the instance types became more heterogeneous (04:15:52).
  • The unblended view showed similar data, with cluster autoscaler giving better performance in the pricing range, especially with homogeneous on-demand instances (04:16:37).
  • David's research used a simulated environment called simcube to test cluster autoscaler and Carpenter in small and large-scale experiments (04:17:52).
  • The experiments used a workload generated by Death Star bench, a microservice application, and simulated a large-scale production-sized cluster locally on a laptop (04:19:54).
  • The results showed that cluster autoscaler and Carpenter had similar scaling behavior in small-scale experiments, but cluster autoscaler's instance type selection was more unpredictable (04:22:25).
  • Cluster autoscaler's default behavior is to randomly pick an instance type that supports the workload, whereas Carpenter selects the same instance types every time (04:24:53).
  • The large-scale experiment involved multiplying a small-scale trace by 100, resulting in around 3,400 nodes, and showed significant pending pods for the cluster autoscaler, with around 1,000 to 2,000 pending pods at times (04:25:17).
  • Carpenter, an auto-provisioner, performed better, with only around 250 pending pods at a time, due to its two-phase model with a fast provisioning loop and a longer compaction loop (04:27:33).
  • The cluster autoscaler's main loop can take up to a couple of minutes to run, with outliers, whereas Carpenter's provisioner takes around 12 seconds at most (04:30:07).
  • Carpenter's disruptor, responsible for bin packing the cluster, takes around 30 to 40 seconds to run its control loop, compared to the cluster autoscaler's two-minute main loop (04:32:55).
  • The experiments used the Amazon Web Services provider for cluster autoscaler, and both cluster autoscaler and Carpenter have a qua provider, which may impact performance (04:30:50).
  • Cluster autoscaler is well-suited for homogeneous environments and can produce great cost savings, while Carpenter works well in heterogeneous environments and gets better as the environment becomes more heterogeneous (04:36:12).
  • Carpenter doesn't work on Google Cloud Platform and some other cloud providers, but there are new emergent alpha cluster API providers, including an Alibaba Group provider and a GCP provider in development (04:37:19).
  • When choosing between Carpenter and cluster autoscaler, it's essential to consider the type of environment and how workloads fit into it, as some workloads may not behave well in certain consolidation environments (04:38:06).
  • Carpenter can be configured to reduce overshoot and provision nodes more conservatively, but this requires getting into the configuration and understanding how workloads are being expressed in Kubernetes (04:40:53).

Zero Downtime Migrations for Stateful Workloads

  • To achieve zero downtime migrations for stateful workloads on Kubernetes, the concept of container state replication is being introduced, which aims to make seamless migrations possible (04:57:42).
  • Stateless services are easier to work with due to their ability to scale instances up and down freely without breaking anything, making infrastructure operations easier (04:58:45).
  • Stateful workloads, such as databases and game servers, are more difficult to deal with and often result in downtime during migrations, making it important to find ways to migrate them easily (04:59:48).
  • Zero downtime migration means that the application can't lose any state, the application itself shouldn't notice the migration, and the client shouldn't notice the migration, with the goal of preserving network connectivity (05:03:30).
  • Loophole Labs allows users to take advantage of spot instances for any workload on any cloud provider, saving 90% on compute costs, and can migrate stateful workloads like databases and game servers without downtime using live migration technology (05:04:34).
  • Stateful application migration generally comes in two categories: application layer replication, such as PostgreSQL log replication, and other methods (05:05:25).
  • Application layer replication involves replicating application data to a different node, requiring the application to be designed to support this, with the goal of achieving stateful migrations with zero downtime (05:06:02).
  • There are different solutions for live workload migration, including VMware's vMotion, KVM, and CRIU, but none of them fulfill the requirements of being hypervisor-agnostic, having zero downtime, and migrating network connections (05:07:25).
  • A new solution is needed, which builds on general live migration solutions and is hypervisor-agnostic, has zero downtime, and can migrate network connections, with the goal of achieving zero downtime under 500 milliseconds (05:08:57).
  • Container state replication is key to achieving zero downtime, by continuously serializing the state of the container in real time and moving it to a different host before stopping it, with the goal of reducing the amount of time the workload is stopped (05:11:40).
  • Live migration can be split into three distinct stages: pre-copy, stop-and-copy, and resume, with the pre-copy phase involving copying the memory and disk for the VM to a new node without interrupting it (05:12:20).
  • The Stop and copy phase in migration is latency-sensitive and can cause significant performance hits if implemented badly, with the goal of minimizing migration time through amortization (05:14:08).
  • Silo, an open-source library, is used to replicate container state without interrupting the application, and it can optimize moving and synchronizing memory based on the application's usage patterns (05:15:50).
  • Silo's post-copy phase allows the application to access data on demand, either from a persistent store or peer-to-peer from the original host, reducing the amount of data that needs to be moved during the Stop and copy phase (05:14:34).
  • A live demo is shown, migrating a Docker (software) container running a PostgreSQL instance between nodes without downtime, using Silo's techniques to track memory in the kernel and hijack the Docker run command (05:19:39).
  • A live migration demo was performed, moving a Postgres instance from one host to another on Amazon Web Services, keeping the network connection alive, and completing an existing transaction without downtime (05:21:03).
  • The demo used a control plane and a kernel module to replicate the state and move the process, and it was done without using virtual machines (05:21:15).
  • The implementation is completely open-sourced, and a project called Drafter allows for zero-downtime migrations for stateful workloads on Kubernetes (05:23:27).
  • A layer called Mason was built on top of CRIU to track memory usage and control network connections, making it possible to migrate containers without downtime (05:24:48).
  • To migrate pods inside Kubernetes, a containerD shim is used to hijack the deletion request, allowing for live migration without breaking cloud-native conventions (05:26:35).
  • The migration process can handle init containers, but a flag is required to ensure the init container runs and completes as expected (05:27:49).

SPIFFE and SPIRE for Workload Identity

  • Spiffe (Secure Production Identity Framework for Everyone) is a project that originated from Google's need for a zero-trust network after the Edward Snowden leaks, and it's used for securely identifying workloads in a unique way (05:33:07).
  • Spiffe is based on an internal Google system, and it's also related to Kubernetes, as one of the co-founders of Kubernetes, Joe, is also involved in Spiffe (05:33:44).
  • Spire Inc. is the production implementation of Spiffe, and it's an open-source, graduated project in the CNCF (Cloud Native Computing Foundation) (05:34:37).
  • Spire uses a central server to hold cryptography keys, and it verifies the identity of workloads through a process called node attestation, which can use various sources of information, such as cloud provider metadata or Kubernetes clusters (05:34:53).
  • Once a workload is verified, Spire issues a Spiffe identity, which is encoded in an X.509 certificate or JSON Web Token token, and this identity is used for mutual Transport Layer Security authentication between workloads (05:34:17).
  • The Spiffe identity can also be used to authenticate to external services, such as Amazon Web Services, and to communicate with databases, such as PostgreSQL, even if they don't natively support Spiffe (05:39:28).
  • A local setup is being demonstrated with Spire server and agent running on a laptop, along with a customer and backend, to verify notes with each other using a one-time token ID (05:42:33).
  • The workload needs to be registered with Spire and attached to a specific node, which is done by defining an entry with the parent ID as the laptop and the service matching the user ID (05:45:29).
  • The Spire Inc. server is then able to provide a certificate to the workload, allowing it to communicate with other applications locally, and this setup can be used to solve issues when running Envoy in front of the application (05:48:18).
  • However, this setup is not suitable for cloud access, which requires a centrally managed Spire server with an OIDC endpoint that can be federated with cloud providers (05:48:58).
  • A second demo is being set up, where the Spire server is running in a Kubernetes cluster on Google Cloud, and the agent is running on the local system, connecting to the central Spire server and bringing up the same workloads with identities served by a CA in the central Spire server (05:49:33).
  • The OIDC endpoint is hosted in the Kubernetes cluster in Google Cloud, and the Spire agent is started without any tokens, using a certificate signed by a specific CA to establish trust (05:50:54).
  • Spire can be used to authenticate and authorize workloads in a Kubernetes cluster, allowing for secure communication between workloads and external services like Amazon Web Services, by using a workload's Spiffe ID and a JSON Web Token token fetched from Spire (05:53:28).
  • The Spiffe ID is used to authenticate the workload, and the JWT token is used to authorize the workload to access specific resources, such as an S3 bucket (05:55:06).
  • The process of authenticating and authorizing workloads with Spire Inc. can be complex and requires gluing together multiple components, but it provides a secure way to manage identities and secrets (05:57:55).
  • There are plans to improve the end-user experience and make it easier to get started with Spire on a local system, possibly by creating a command-line tool called Crush CTL that can spin up a Spire server and agent with a single command (06:00:55).
  • Spire does not provide a built-in solution for managing trust and authorization between workloads, leaving it to third-party tools like Opa to provide this functionality (06:02:29).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment