ECK storage recommendations

tl;dr (short version)
Kubernetes storage options
PersistentVolumes in details
Local PersistentVolumes example

tl;dr (short version)

It depends! 😉

ECK does not come up with its own storage mechanism for Elasticsearch data. Instead, it is compatible with any Kubernetes storage option. We recommend using PersistentVolumes: they can be network-attached or local. Both are handled the same way by ECK and can be combined for different node types in a single cluster. There is a performance/price/operations trade-off to consider:

Network-attached PersistentVolumes
- Provide good performance when using the fastest storage class for most big cloud providers (e.g. io1/gp2 volumes on AWS).
- Provide bad performance when the underlying storage performance is bad (e.g. NFS shared volumes).
- Very easy to operate since a Pod can be automatically rescheduled on a different host while reattaching the same volume.
Local PersistentVolumes
- Provide the best performance when the underlying storage is very fast (e.g. NVME SSD disks).
- Much harder to operate: will likely require manual intervention if a host dies (data loss), otherwise the Pod will stay Pending.
- Need to be either manually provisioned, or automatically provisioned through an additional provisioner.

One way to tackle this trade-off:

If a network-attached PersistentVolume provider is available in the k8s environment:
- Consider using it: Elastic has historically discouraged network-attached storage, but this is now a viable option, especially for k8s environments running on major cloud providers (for ex. AWS, GCP, Azure) which provide acceptable performance.
- Study its pricing options
- Benchmark it against the expected target cluster usage
- If price and performance are OK: go for it
Otherwise, consider using local PersistentVolumes:
- Choose a local volume provisioner (Kubernetes Local Volume Static Provisioner is a good start)
- Be aware of local PersistentVolumes operational concerns (described below)

Kubernetes storage options

emptyDir

An emptyDir is an ephemeral volume that exists as long as a Pod is running on a given node. If the Pod is scheduled to another node for any reason (e.g. k8s scheduling priorities, host maintenance, node restart), the emptyDir data is deleted forever. A sizeLimit can be specified (e.g. 1Gi): once reached, the Pod is automatically evicted, which means its data is lost forever.

The actual underlying storage depends on the Kubernetes environment. It may be a tmpfs in RAM, HDD/SSD disks, etc.

emptyDirs are useful for ephemeral data that can be lost at any time. They are not recommended for storing Elasticsearch data. They may still be useful for coordinating-only nodes or ingest-only nodes, where losing data does not impact the cluster much. The Volume claim templates section explains how to setup emptyDir volumes with ECK.

hostPath

hostPath volumes allow mounting a filesystem path from the host a Pod is running on (e.g. /mnt/data). No size limit can be provided when using hostPath: the Pod will simply access the filesystem as it is. Since ECK expects that a Pod can be recreated to reuse the same volume at any time, it is important to ensure that the same Pod can only be scheduled on the same Kubernetes node. Which means every single Pod must have affinity constraints that force it to run on a statically predefined Kubernetes node.

hostPath volumes do not fit StatefulSets very well: when using hostPath we likely want a different statically defined affinity constraint for each Pod, which means a single Pod per StatefulSet.

It is definitely possible to use hostPath volumes for Elasticsearch, but similar things can be achieved with the local PersistentVolumes approach. The latter provides much more flexibility, since it automatically handles the affinity constraints at the volume level (as opposed to the user handling them at the Pod specification level). It also better fits the Kubernetes StatefulSet concept. We recommend using local PersistentVolumes instead of hostPath.

PersistentVolumes

PersistentVolumes are best used with StatefulSets (logical group of Pods sharing the same specification), which is the resource ECK relies on to manage Elasticsearch. When a Pod is bound to a PersistentVolume, the binding between the two resources remains until manually removed. ECK relies on this binding mechanism during Elasticsearch rolling upgrades: we know a Pod can be safely deleted, then recreated, because it will reuse the same PersistentVolume, hence the same data.

This binding is enforced at Pod scheduling time: if an existing PersistentVolume is already bound to the Pod about to be created (by name), that volume will be used automatically. If the volume is not available (say, it is only available on a single host and that host is down), then the Pod will not be able to start successfully and will stay in a Pending state.

PersistentVolumes are associated with a storageClass to represent the underlying volume implementation (e.g. local, gp2, gke-pd, etc.). Different storageClass offer different performance. Most cloud providers provide their own storageClass, but the user is free to create any additional storageClass, or setup any additional PersistentVolume provider. This page in the Kubernetes documentation references various storage classes.

A size limit can be provided in the volume specification. The actual enforcement of that limit depends on the provider implementation.

PersistentVolumes are the recommended way of deploying Elasticsearch with ECK.

PersistentVolumes in details

Understanding StatefulSet and PersistentVolume interactions

Let's take an example to understand the Elasticsearch -> StatefulSet -> PersistentVolumeClaim -> PersistentVolume -> Pod relationship.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.8.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 5Gi
          storageClassName: standard

Each item in spec.nodeSets maps 1:1 to a StatefulSet. The Elasticsearch specification above is transformed by ECK into a single StatefulSet (since there is a single nodeSet), handled by the StatefulSet controller:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: "2020-07-03T13:19:22Z"
  generation: 1
  name: elasticsearch-sample-es-default
  namespace: default
spec:
  replicas: 3
  template: (...)
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: elasticsearch-data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: standard
      volumeMode: Filesystem
    status:
      phase: Pending

In turn, the StatefulSet controller (part of the core Kubernetes controllers) translates the StatefulSet into 3 Pods:

kubectl get pods

elasticsearch-sample-es-default-0
elasticsearch-sample-es-default-1
elasticsearch-sample-es-default-2

And 3 PersistentVolumeClaims:

kubectl get pvc

elasticsearch-data-elasticsearch-sample-es-default-0
elasticsearch-data-elasticsearch-sample-es-default-1
elasticsearch-data-elasticsearch-sample-es-default-2

Each claim represents a desire for a Pod to use a PersistentVolume. The claim is referenced in the Pod resource:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: elasticsearch-data-elasticsearch-sample-es-default-0
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard
  volumeMode: Filesystem
status:
  phase: Pending

apiVersion: v1
kind: Pod
metadata:
  name: elasticsearch-sample-es-default-0
  namespace: default
spec:
  containers: (...)
  volumes:
  - name: elasticsearch-data
    persistentVolumeClaim:
      claimName: elasticsearch-data-elasticsearch-sample-es-default-0
  status:
  phase: Pending

Both the Pod and the PersistentVolumeClaim stay Pending until a PersistentVolume is bound to the PersistentVolumeClaim. PersistentVolumes dynamic provisioners can create the PersistentVolume resource just in time. Once done, the Pod can start with the volume referenced by the claim. See how Pod, PersistentVolumeClaim and PersistentVolume reference each other:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:icsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
  name: elasticsearch-data-elasticsearch-sample-es-default-0
  namespace: defaultb921-713b-4235-b05a-3a504e51d930
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard
  volumeMode: Filesystem
  volumeName: pvc-88c7b921-713b-4235-b05a-3a504e51d930 # reference to the PersistentVolume bound to that PersistentVolumeClaim
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  phase: Bound

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-88c7b921-713b-4235-b05a-3a504e51d930
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: elasticsearch-data-elasticsearch-sample-es-default-0
    namespace: default
    resourceVersion: "505313"
    uid: 88c7b921-713b-4235-b05a-3a504e51d930
  gcePersistentDisk:
    fsType: ext4
    pdName: local-njv5q-dynamic-pvc-88c7b921-713b-4235-b05a-3a504e51d930
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - europe-west2-c
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - europe-west2
  persistentVolumeReclaimPolicy: Delete
  storageClassName: standard
  volumeMode: Filesystem
status:
  phase: Bound

apiVersion: v1
kind: Pod
metadata:
  name: elasticsearch-sample-es-default-0
  namespace: default
spec:
  containers: (...)
  volumes:
  - name: elasticsearch-data
    persistentVolumeClaim:
      claimName: elasticsearch-data-elasticsearch-sample-es-default-0
  status:
  phase: Running

There is a strong deterministic relationship between a Pod and a PersistentVolumeClaim. In the example above, the Pod elasticsearch-sample-es-default-0 relies on a PersistentVolumeClaim named elasticsearch-data-elasticsearch-sample-es-default-0 (<volume name>-<pod-name>). On Pod creation, if a PersistentVolumeClaim with that name already exists, it is reused. If not, it is created automatically. As long as the claim exists, and is bound to a volume, the Pod can be deleted and recreated: it will reuse the same volume.

StorageClass settings

volumeBindingMode: WaitForFirstConsumer

It is very important for the PersistentVolume StorageClass to have volumeBindingMode: WaitForFirstConsumer set. Otherwise a Pod may be scheduled, because of affinity settings, on a host where the bound PersistentVolume is not available. Surprisingly, this setting isn't applied by default on many cloud providers default StorageClass. Fortunately the user can create (or patch) their own StorageClass with the same underlying provider, and set volumeBindingMode: WaitForFirstConsumer.

Reclaim policy

The reclaim policy of a storageClass specifies whether a PersistentVolume should be automatically deleted once its corresponding PersistentVolumeClaim is deleted. It can be set to Delete, or Retain.

ECK deletes PersistentVolumeClaims automatically once they are no longer needed, following a cluster downscale or deletion. However ECK does not delete PersistentVolumes. The user must be careful not to reuse an existing PersistentVolume belonging to a different Elasticsearch cluster, for a new claim. Elasticsearch will refuse to start since it will detect the data belongs to a different cluster. For that reason we generally recommend using the Delete reclaim policy.

Static vs. dynamic provisioning

We can distinguish two ways to provision PersistentVolume resources:

Static provisioning: Pre-creates all the desired PersistentVolumes before creating the StatefulSet. Upon StatefulSet creation, Kubernetes attempts to bind a pending PersistentVolumeClaim to one of the available PersistentVolumes.
Dynamic provisioning: Does not create any PersistentVolume in advance. Instead, a provisioner notices that some PersistentVolumeClaims are pending, and automatically creates the corresponding PersistentVolumes. Kubernetes then attempts to bind the PersistentVolumeClaims to the newly created PersistentVolumes.

Most network-attached PersistentVolumes rely on dynamic provisioning: you only want a volume to be provisioned, and pay for it, once a Pod requires it. You also likely want the volume to have the exact size you expect (not more). Cloud providers come up with their own dynamic provisioners by default, but it's also possible to deploy your own.

Static provisioning is more often associated to local volumes: you want one volume per device, on each node, with a fixed size (corresponding to the physical or logical partition size). Volumes stay around, pending a Pod request. A Pod may request less than the available volumes size: it can still be bound to a volume with a larger size. Static provisioning can be achieved manually (create the PersistentVolume resource "by hand"), or through a static provisioner (a process on each host detects existing disks automatically and creates the corresponding PersistentVolumes).

Network-attached vs. local PersistentVolumes

PersistentVolumes come in two flavours, with a different performance/price/operations trade-off. From ECK's perspective there is no difference, both are handled the same way.

Network-attached PersistentVolumes can generally be attached to a Pod regardless of the host it is scheduled on. This gives a huge operational benefit: if that host goes down, or needs to be replaced, the Pod can simply be removed from it. It will be rescheduled automatically on a different node (generally in the same zone), and reattached to the same volume. This can take only a few seconds, and does not require any human intervention. The downside is performance: IOps and latency is generally not as good as locally-attached disks. The fastest volumes from major Cloud providers (GKE/AKS/EKS) still give good performance (see the Peformance section below). Most cloud providers charge per GB-month and IOPS-month.

Local PersistentVolumes are bound to a particular host, and generally map a directory on the host filesystem. The Pod can only be scheduled on that host in order to reuse the same volume. This is handled by affinity settings in the PersistentVolume itself. If that host goes down, or needs to be replaced, the Pod will not be scheduled on another host. It remains in a Pending state until the host comes back, or until the PersistentVolumeClaim is deleted (manual operation). Performance is as good as the underlying device and filesystem performance. Cloud providers generally charge per mounted local disk.

Performance

Benchmarks are complicated, and depend a lot on the actual cluster usage. Generally speaking, locally attached SSDs give better performance per dollar, but the fastest network-attached volumes (e.g. AWS io1) seem to give decent performance.

See the benchmarks page for numbers and a more detailed comparison between network-attached and local storage options. This other benchmark page compares Rally http_logs track with what is published on elasticsearch-benchmarks.

We recommend benchmarking the available storage options on your environment before drawing any conclusion.

Local PersistentVolumes operational concerns

Host maintenance

To temporarily remove a Kubernetes node for maintenance purposes, we usually cordon it (mark it as unschedulable), and drain it (kubectl drain <node>), to force all Pods to be rescheduled elsewhere. With local PersistentVolumes, the evicted Pod can only be scheduled on the same Kubernetes node. If the node is unschedulable (or not available anymore), the Pod stays Pending. Once the host is back in the cluster again, the Pod is automatically started and reuses the same PersistentVolume.

ECK sets up a default PodDisruptionBudget to control how many Pods per Elasticsearch cluster can be safely evicted during node drains. When the Elasticsearch cluster has a green health, the PDB allows one Pod disruption. Considering a single Elasticsearch node per Kubernetes host, it means hosts can be upgraded one by one, as long as Elasticsearch is healthy. It may take some time for the cluster to become green again between host upgrades.

Some cloud provider Kubernetes offerings only respect the PodDisruptionBudget for a certain amount of time, before killing all Pods on the node. For example, GKE automated version upgrade rotates all nodes (without preserving local volumes), and only respect the PodDisruptionBudget for 1 hour. In such cases it is preferable to manually handle the version upgrade: drain the node pool and re-schedule the Pods on a different (up-to-date) node pool.

Host removal

If a Kubernetes node experiences a failure, or is permanently removed from the cluster, the local data is likely lost. The corresponding Pod will stay Pending because it cannot be attached its unrecoverable PersistentVolume anymore.

When we know that the host will not come back alive, we likely want the Pod to be scheduled again on a new host with an empty data volume, and rely on Elasticsearch shard replication to recover the missing data. This can be done by manually removing both the PersistentVolumeClaim and the Pod resources. A new Pod with the same name is created with a new PersistentVolumeClaim (empty data). That Pod can be scheduled on any of the other available Kubernetes nodes.

This "manual" process could be automated by deleting the PVC before removing the node, or by setting up a process that automatically deletes PVCs when the Node resource is deleted from Kubernetes. See these GitHub issues: kubernetes-sigs/sig-storage-local-static-provisioner#201, kubernetes-sigs/sig-storage-local-static-provisioner#181, kubernetes/enhancements#1116.

For cases where node removal is planned in advance, we could eventually introduce a way to automatically migrate data away from the Elasticsearch node before deleting it.

Lack of resource reservation during upgrades

As part of rolling upgrades (configuration change, version upgrade, etc.), ECK deletes existing Pods one by one. Those Pods are automatically recreated with the newer specification, and reuse the same PersistentVolumes. In the small time window between Pod deletion and re-creation, the scheduler may allocate a different Pod on the host. At that point the host may not have enough spare resources (CPU, RAM) anymore to allocate the replacing Elasticsearch Pod, which then stays Pending. Although rare, this might still happen on production systems with concurrent Pod creations.

It can be worked around by setting a high priority-class to Pods that rely on local volumes. If the Kubernetes node with the existing local volume does not have spare resources, Kubernetes starts evicting lower-priority Pods on that node in order to schedule the higher-priority ones.

Another way to deal with this is to use taints and tolerations. They allow dedicating Kubernetes nodes to Elasticsearch workloads.

It may also help to use fixed RAM/storage ratios, and optimize for having the storage entirely used if the other node resources are entirely used. For example, if we always associate 8GB RAM with 1TB storage, a 32GB host (4TB storage) can hold 2x8GB Pods (2x1TB) + 1x16GB Pod (1x2TB). When a 8GB Pod is temporarily deleted for recreation, no other Pod can be scheduled concurrently since there's no volume available on the host.

Conflicting upgrades

Imagine a situation where a user wants to increase an existing Elasticsearch Pod RAM requirements from 8GB to 32GB. Since the Pod exists with a local PersistentVolume bound to a particular Kubernetes node, the replacing 32GB Pod must be scheduled on the same Kubernetes node. However nothing guarantees that the Kubernetes node has 32GB RAM available. ECK relies entirely on the Kubernetes scheduler, and does not inspect the nodes specification in any way. To perform the RAM increase, ECK deletes the existing Pod, which is recreated automatically, and reuses the same PersistentVolume. However, if the host does not have 32GB RAM available, the Pod stays Pending forever. There are several ways out of this situation:

restore the previous RAM requirements so the Pod can be scheduled again on the same host
manually delete PVC and Pod (acknowledging data loss - which should be fine since ECK ensured the cluster is green before deleting the Pod), so they can be scheduled elsewhere
ensure Pods with the newer spec are created on a different Kubernetes node, with data migrated over, by renaming the nodeSet

When using local volumes, make sure Elasticsearch indices have at least one replica to guarantee shard availability.

Scheduling capacity awareness

There are multiple ways to configure Kubernetes scheduling priorities to optimize resource allocation. However, the default Kubernetes scheduler is not aware of any storage capacity constraint. It cannot favor hosts with the most (or least) remaining storage capacity.

See these two KEP.

Local PersistentVolumes providers

Manual provisioning

PersistentVolumes can be manually created to rely on the hostPath mechanism. The following examples set up a local PersistentVolume bound to the node my-node, where data is mounted on the host from /mnt/mydata:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-local-pv
spec:
  capacity:
    storage: 5Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /mnt/mydata
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - my-node

It can be bound to a claim that specifies the same storageClassName:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-claim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: local-storage

Kubernetes Local Volume Static Provisioner

Maintained by the Kubernetes community and part of sig-storage, this provisioner creates PersistentVolumes by auto-discovering each disk on the host (for example, one PersistentVolume per directory in /mnt/disks/*). It can also handle partition formatting (for example, ext4). Once a volume has been released, it gets cleaned up automatically and a new fresh replacing PersistentVolume is created. The provisioner is a good fit for use cases where an entire disk or partition should be dedicated to a single volume.

Rancher Local Path Provisioner

This provisioner handles dynamic PersistentVolume provisioning, based on a list of filesystem directories where hostPath volumes can be created (one sub-dir per volume). It can be configured with the provisioning path(s) for each host. It handles cleaning up data on volume removal. Even though a PersistentVolumeClaim can specify a storage size (e.g. 10GB), the provisioner does not enforce any capacity check on the volume itself (the underlying filesystem usage can grow larger than 10GB). The provisioner fits well for use cases where multiple volumes need to be dynamically provisioned, and disk usage enforcement is not a concern.

OpenEBS local volume provisioners

OpenEBS is a storage solution for Kubernetes that supports several volume types, including hostPath local volumes and block device local volumes.

Openshift Local Storage Operator

From the Openshift docs, it looks like this operator allows users to define their volume using a LocalVolume custom resource. I could not find many implementation details.

TopoLVM

TopoLVM is a CSI plugin for LVM volumes. It is able to dynamically provision PersistentVolumes of the desired size, with the benefits of using LVM (multiple disks into a single logical volume, volume expansion, thin provisioning, etc.). It extends the Kubernetes scheduler to be capacity-aware, in order to prioritize nodes with the larger remaining capacity. It requires lvmd to be installed on all hosts. The user must have permissions to configure the extended scheduler, which is unfortunately not possible on most cloud providers Kubernetes offerings.

Local PersistentVolumes example

In this example, we'll deploy a production-grade Elasticsearch cluster on GKE, using the Kubernetes Local Volume Static Provisioner, with:

3 dedicated master nodes (4Gi RAM, 1 CPU, 10GB storage)
6 data nodes (50Gi RAM, 15 CPU, 3TB storage)

All nodes rely on local SSD PersistentVolumes. Since GKE is limited to 375GB local SSDs, we setup a RAID 0 from 8 local SSDs for data nodes.

First, let's create a GKE Kubernetes cluster with 9 nodes in 2 different node pools: one for master nodes, another one for data nodes.

gcloud beta container --project "elastic-cloud-dev" clusters create "seb-localpv-cluster" --region "us-central1" --no-enable-basic-auth --cluster-version "1.16.9-gke.6" --machine-type "n1-standard-2" --image-type "UBUNTU" --disk-type "pd-standard" --disk-size "100" --local-ssd-count "1" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/elastic-cloud-dev/global/networks/default" --subnetwork "projects/elastic-cloud-dev/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --no-enable-autoupgrade --no-enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 && gcloud beta container --project "elastic-cloud-dev" node-pools create "data-pool" --cluster "seb-localpv-cluster" --region "us-central1" --node-version "1.16.9-gke.6" --machine-type "n1-standard-16" --image-type "UBUNTU" --disk-type "pd-standard" --disk-size "100" --local-ssd-count "8" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --no-enable-autoupgrade --no-enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0

Then, deploy the static provisioner, with a local-storage storageClass. The provisioner runs as DaemonSet on each Kubernetes node, and auto-discovers any disk in /mnt/disks. As part of an initContainer, we run an extra bash script to assemble a RAID 0 array from all available SSDs, formatted to ext4. Simpler setups with no RAID can remove or tweak the local-ssd-startup init container. The bash script comes from https://github.com/pingcap/tidb-operator/blob/master/manifests/gke/local-ssd-provision/local-ssd-provision.yaml.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: local-provisioner-config
  namespace: kube-system
data:
  setPVOwnerRef: "true"
  nodeLabelsForPV: |
    - kubernetes.io/hostname
  storageClassMap: |
    local-storage:
      hostDir: /mnt/disks
      mountDir: /mnt/disks
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: local-volume-provisioner
  namespace: kube-system
  labels:
    app: local-volume-provisioner
spec:
  selector:
    matchLabels:
      app: local-volume-provisioner
  template:
    metadata:
      labels:
        app: local-volume-provisioner
    spec:
      hostPID: true
      nodeSelector:
        cloud.google.com/gke-local-ssd: "true"
      serviceAccountName: local-storage-admin
      initContainers:
        - name: local-ssd-startup
          image: alpine
          command: ['/bin/sh', '-c', 'nsenter -t 1 -m -u -i -n -p -- bash -c "${STARTUP_SCRIPT}"']
          securityContext:
            privileged: true
          volumeMounts:
          - mountPath: /mnt/disks
            name: local-disks
            mountPropagation: Bidirectional
          env:
          - name: STARTUP_SCRIPT
            value: |
                #!/usr/bin/env bash
                set -euo pipefail
                set -x

                # discard,nobarrier are required to optimize local SSD
                # performance in GCP, see
                # https://cloud.google.com/compute/docs/disks/performance#optimize_local_ssd
                mnt_opts="defaults,nodelalloc,noatime,discard,nobarrier"

                # use /var because it is writeable on COS
                if ! findmnt -n -a -l | grep /mnt/disks/ssd ; then
                  if test -f /var/ssd_mounts ; then
                    ssd_mounts=$(cat /var/ssd_mounts)
                  else
                    echo "no ssds mounted yet"
                    exit 1
                  fi
                else
                  ssd_mounts=$(findmnt -n -a -l --nofsroot | grep /mnt/disks/ssd)
                  echo "$ssd_mounts" > /var/ssd_mounts
                fi

                # Re-mount all disks as a single logical volume with a UUID
                if old_mounts=$(findmnt -n -a -l --nofsroot | grep /mnt/disks/ssd) ; then
                  echo "$old_mounts" | awk '{print $1}' | while read -r ssd ; do
                    umount "$ssd"
                    new_fstab=$(grep -v "$ssd" /etc/fstab) || echo "fstab is now empty"
                    echo "$new_fstab" > /etc/fstab
                  done
                fi
                echo "$ssd_mounts" | awk '{print $1}' | while read -r ssd ; do
                  if test -d "$ssd"; then
                    rm -r "$ssd"
                  fi
                done

                devs=$(echo "$ssd_mounts" | awk '{print $2}')
                raid_dev=/dev/md0

                # If RAID or LVM is already in use, this may have been re-deployed.
                # Don't try to change the disks.
                pvs=$((test -x /sbin/pvs && /sbin/pvs) || echo "")
                if ! test -e $raid_dev && ! echo "$pvs" | grep volume_all_ssds ; then
                  # wipe all devices
                  echo "$devs" | while read -r dev ; do
                    dev_basename=$(basename "$dev")
                    mkdir -p /var/dev_wiped/
                    if ! test -f /var/dev_wiped/$dev_basename ; then
                      if findmnt -n -a -l --nofsroot | grep "$dev" ; then
                        echo "$dev" already individually mounted
                        exit 1
                      fi
                      /sbin/wipefs --all "$dev"
                      touch /var/dev_wiped/$dev_basename
                    fi
                  done

                  # Don't combine if there is 1 disk or the environment variable is set.
                  # lvm and mdadm do have overhead, so don't use them if there is just 1 disk
                  # remount with uuid, set mount options (nobarrier), and exit
                  NO_COMBINE_LOCAL_SSD="${NO_COMBINE_LOCAL_SSD:-""}"
                  if ! test -z "$NO_COMBINE_LOCAL_SSD" || [ "$(echo "$devs" | wc -l)" -eq 1 ] ; then
                    echo "$devs" | while read -r dev ; do
                      if ! findmnt -n -a -l --nofsroot | grep "$dev" ; then
                        if ! uuid=$(blkid -s UUID -o value "$dev") ; then
                          mkfs.ext4 "$dev"
                          uuid=$(blkid -s UUID -o value "$dev")
                        fi
                        mnt_dir="/mnt/disks/$uuid"
                        mkdir -p "$mnt_dir"
                        if ! grep "$uuid" /etc/fstab ; then
                          echo "UUID=$uuid $mnt_dir ext4 $mnt_opts" >> /etc/fstab
                        fi
                        mount -U "$uuid" -t ext4 --target "$mnt_dir" --options "$mnt_opts"
                        chmod a+w "$mnt_dir"
                      fi
                    done

                    exit 0
                  fi
                fi

                new_dev=
                USE_LVM="${USE_LVM:-""}"
                # If RAID is available use it because it performs better than LVM
                if test -e $raid_dev || (test -x /sbin/mdadm && test -z "$USE_LVM") ; then
                  if ! test -e $raid_dev ; then
                    echo "$devs" | xargs /sbin/mdadm --create $raid_dev --level=0 --raid-devices=$(echo "$devs" | wc -l)
                    sudo mkfs.ext4 -F $raid_dev
                    new_dev=$raid_dev
                  fi
                else
                  if ! echo "$pvs" | grep volume_all_ssds ; then
                    echo "$devs" | xargs /sbin/pvcreate
                  fi
                  /sbin/pvdisplay
                  if ! /sbin/vgs | grep volume_all_ssds ; then
                    echo "$devs" | xargs /sbin/vgcreate volume_all_ssds
                  fi
                  /sbin/vgdisplay
                  if ! /sbin/lvs | grep logical_all_ssds ; then
                    /sbin/lvcreate -l 100%FREE -n logical_all_ssds volume_all_ssds
                  fi
                  /sbin/lvdisplay
                  new_dev=/dev/volume_all_ssds/logical_all_ssds
                fi

                if ! uuid=$(blkid -s UUID -o value $new_dev) ; then
                  mkfs.ext4 $new_dev
                  uuid=$(blkid -s UUID -o value $new_dev)
                fi

                mnt_dir="/mnt/disks/$uuid"
                mkdir -p "$mnt_dir"

                if ! grep "$uuid" /etc/fstab ; then
                  echo "UUID=$uuid $mnt_dir ext4 $mnt_opts" >> /etc/fstab
                fi
                mount -U "$uuid" -t ext4 --target "$mnt_dir" --options "$mnt_opts"
                chmod a+w "$mnt_dir"
      containers:
        - image: "quay.io/external_storage/local-volume-provisioner:v2.3.4"
          name: provisioner
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: 100m
              memory: 100Mi
            limits:
              cpu: 100m
              memory: 100Mi
          env:
          - name: MY_NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          - name: MY_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: JOB_CONTAINER_IMAGE
            value: "quay.io/external_storage/local-volume-provisioner:v2.3.2"
          volumeMounts:
            - mountPath: /etc/provisioner/config
              name: provisioner-config
              readOnly: true
            - mountPath: /mnt/disks
              name: local-disks
              mountPropagation: "HostToContainer"
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: provisioner-config
          configMap:
            name: local-provisioner-config
        - name: local-disks
          hostPath:
            path: /mnt/disks
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: local-storage-admin
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: local-storage-provisioner-pv-binding
  namespace: kube-system
subjects:
- kind: ServiceAccount
  name: local-storage-admin
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: system:persistent-volume-provisioner
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: local-storage-provisioner-node-clusterrole
  namespace: kube-system
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: local-storage-provisioner-node-binding
  namespace: kube-system
subjects:
- kind: ServiceAccount
  name: local-storage-admin
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: local-storage-provisioner-node-clusterrole
  apiGroup: rbac.authorization.k8s.io

Check that one PersistentVolume with a Delete reclaim policy has been created for each host:

kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
local-pv-36f2b019   2950Gi     RWO            Delete           Available           local-storage            7m56s
local-pv-3be19c8f   2950Gi     RWO            Delete           Available           local-storage            7m37s
local-pv-5543ea00   2950Gi     RWO            Delete           Available           local-storage            7m38s
local-pv-bd1aac10   2950Gi     RWO            Delete           Available           local-storage            7m38s
local-pv-c3f4f33    368Gi      RWO            Delete           Available           local-storage            8m22s
local-pv-d870e432   368Gi      RWO            Delete           Available           local-storage            8m21s
local-pv-e5615568   368Gi      RWO            Delete           Available           local-storage            8m21s
local-pv-e693f7e8   2950Gi     RWO            Delete           Available           local-storage            7m50s
local-pv-ed30daed   2950Gi     RWO            Delete           Available           local-storage            7m38s

Create an Elasticsearch cluster with 3 dedicated master nodes and 6 data nodes. Both rely on the local-storage storageClass we just created.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: mycluster
spec:
  version: 7.8.0
  nodeSets:
  - name: master-nodes
    count: 3
    config:
      node.master: true
      node.data: false
      node.ingest: false
      node.ml: false
      node.store.allow_mmap: false
    podTemplate:
      spec:
        affinity:
          # schedule master nodes on the default-pool
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-nodepool
                  operator: In
                  values: ["default-pool"]
          # don't put two master nodes on the same host
          # (note this is already set by ECK, by default)
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: mycluster
              topologyKey: kubernetes.io/hostname
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms2g -Xmx2g
          resources:
            requests:
              memory: 4Gi
              cpu: 1
            limits:
              memory: 4Gi
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 10Gi
          storageClassName: local-storage
  - name: data-nodes
    count: 6
    config:
      node.master: false
      node.data: true
      node.ingest: true
      node.ml: true
      node.store.allow_mmap: false
    podTemplate:
      spec:
        affinity:
          # schedule data nodes on the data-pool
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-nodepool
                  operator: In
                  values: ["data-pool"]
          # don't put two data nodes on the same host
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  elasticsearch.k8s.elastic.co/cluster-name: mycluster
              topologyKey: kubernetes.io/hostname
        containers:
        - name: elasticsearch
          env:
          - name: ES_JAVA_OPTS
            value: -Xms25g -Xmx25g
          resources:
            requests:
              memory: 50Gi
              cpu: 15
            limits:
              memory: 50Gi
    volumeClaimTemplates:
      - metadata:
          name: elasticsearch-data
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 2950Gi
          storageClassName: local-storage

The 3 Pods have been created and their volumes are bound:

kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-0     1/1     Running   0          55s
mycluster-es-data-nodes-1     1/1     Running   0          55s
mycluster-es-data-nodes-2     1/1     Running   0          55s
mycluster-es-data-nodes-3     1/1     Running   0          55s
mycluster-es-data-nodes-4     1/1     Running   0          55s
mycluster-es-data-nodes-5     1/1     Running   0          55s
mycluster-es-master-nodes-0   1/1     Running   0          56s
mycluster-es-master-nodes-1   1/1     Running   0          56s
mycluster-es-master-nodes-2   1/1     Running   0          56s

kubectl get pvc
NAME                                             STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS    AGE
elasticsearch-data-mycluster-es-data-nodes-0     Bound    local-pv-bd1aac10   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-data-nodes-1     Bound    local-pv-e693f7e8   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-data-nodes-2     Bound    local-pv-36f2b019   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-data-nodes-3     Bound    local-pv-5543ea00   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-data-nodes-4     Bound    local-pv-ed30daed   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-data-nodes-5     Bound    local-pv-3be19c8f   2950Gi     RWO            local-storage   73s
elasticsearch-data-mycluster-es-master-nodes-0   Bound    local-pv-e5615568   368Gi      RWO            local-storage   74s
elasticsearch-data-mycluster-es-master-nodes-1   Bound    local-pv-d870e432   368Gi      RWO            local-storage   74s
elasticsearch-data-mycluster-es-master-nodes-2   Bound    local-pv-c3f4f33    368Gi      RWO            local-storage   74s

Kill a Pod, it should be recreated automatically with the same PVC and PV:

kubectl delete pod mycluster-es-data-nodes-0

Decrease the number of data nodes by editing the count of the second nodeSet. There are now 5 Elasticsearch data Pods left. The PersistentVolume of the deleted Pod should be automatically released, cleaned up, then recreated and marked available:

kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-0     1/1     Running   0          2m38s
mycluster-es-data-nodes-1     1/1     Running   0          4m52s
mycluster-es-data-nodes-2     1/1     Running   0          4m52s
mycluster-es-data-nodes-3     1/1     Running   0          4m52s
mycluster-es-data-nodes-4     1/1     Running   0          4m52s
mycluster-es-master-nodes-0   1/1     Running   0          4m53s
mycluster-es-master-nodes-1   1/1     Running   0          4m53s
mycluster-es-master-nodes-2   1/1     Running   0          4m53s

kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                    STORAGECLASS    REASON   AGE
local-pv-36f2b019   2950Gi     RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-data-nodes-2     local-storage            5m32s
local-pv-3be19c8f   2950Gi     RWO            Delete           Available                                                            local-storage            3s
local-pv-5543ea00   2950Gi     RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-data-nodes-3     local-storage            38m
local-pv-bd1aac10   2950Gi     RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-data-nodes-0     local-storage            5m34s
local-pv-c3f4f33    368Gi      RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-master-nodes-2   local-storage            5m27s
local-pv-d870e432   368Gi      RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-master-nodes-1   local-storage            5m27s
local-pv-e5615568   368Gi      RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-master-nodes-0   local-storage            5m27s
local-pv-e693f7e8   2950Gi     RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-data-nodes-1     local-storage            5m26s
local-pv-ed30daed   2950Gi     RWO            Delete           Bound       default/elasticsearch-data-mycluster-es-data-nodes-4     local-storage            38m

Drain a Kubernetes node holding one of the data nodes:

kubectl drain gke-seb-localpv-cluster-data-pool-58b0702c-2754 --ignore-daemonsets --delete-local-data

Notice how the Pod mycluster-es-data-nodes-1 got terminated, and is now Pending because it cannot be scheduled on the Kubernetes node holding its PersistentVolume:

kubectl get pods
NAME                          READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-0     1/1     Running   0          4m52s
mycluster-es-data-nodes-1     0/1     Pending   0          6s
mycluster-es-data-nodes-2     1/1     Running   0          7m6s
mycluster-es-data-nodes-3     1/1     Running   0          7m6s
mycluster-es-data-nodes-4     1/1     Running   0          7m6s
mycluster-es-master-nodes-0   1/1     Running   0          7m7s
mycluster-es-master-nodes-1   1/1     Running   0          7m7s
mycluster-es-master-nodes-2   1/1     Running   0          7m7s

kubectl describe pod mycluster-es-data-nodes-1
(...)
Events:
  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   58s   default-scheduler   0/9 nodes are available: 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 3 node(s) didn't match node selector, 7 Insufficient cpu, 7 Insufficient memory

Simulate the Kubernetes node coming back online:

kubectl uncordon gke-seb-localpv-cluster-data-pool-58b0702c-2754

The Pod should automatically start on the node, and reattach to the existing volume:

kubectl get pod mycluster-es-data-nodes-1
NAME                        READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-1   1/1     Running   0          2m10s

Now let's simulate a complete Kubernetes node removal (or failure). Drain the node again:

kubectl drain gke-seb-localpv-cluster-data-pool-58b0702c-2754 --ignore-daemonsets --delete-local-data

The Pod is Pending:

kubectl get pod mycluster-es-data-nodes-1
NAME                        READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-1   0/1     Pending   0          9s

Let's consider the data is lost and unrecoverable. We'd like the Pending Pod to be scheduled on another available host, and start with an empty data volume. In order to do so, we need to remove both PersistentVolumeClaim and Pod:

kubectl delete pvc elasticsearch-data-mycluster-es-data-nodes-1
kubectl delete pod mycluster-es-data-nodes-1

The Pod is now automatically started on the available Kubernetes node, with a new empty local volume:

kubectl get pod mycluster-es-data-nodes-1
NAME                        READY   STATUS    RESTARTS   AGE
mycluster-es-data-nodes-1   1/1     Running   0          33s

dalei2019/[WIP]ECK storage recommendations.md