- tl;dr (short version)
- Kubernetes storage options
- PersistentVolumes in details
- Local PersistentVolumes example
It depends! 😉
ECK does not come up with its own storage mechanism for Elasticsearch data. Instead, it is compatible with any Kubernetes storage option. We recommend using PersistentVolumes: they can be network-attached or local. Both are handled the same way by ECK and can be combined for different node types in a single cluster. There is a performance/price/operations trade-off to consider:
-
Network-attached PersistentVolumes
- Provide good performance when using the fastest storage class for most big cloud providers (e.g.
io1
/gp2
volumes on AWS). - Provide bad performance when the underlying storage performance is bad (e.g. NFS shared volumes).
- Very easy to operate since a Pod can be automatically rescheduled on a different host while reattaching the same volume.
- Provide good performance when using the fastest storage class for most big cloud providers (e.g.
-
Local PersistentVolumes
- Provide the best performance when the underlying storage is very fast (e.g. NVME SSD disks).
- Much harder to operate: will likely require manual intervention if a host dies (data loss), otherwise the Pod will stay Pending.
- Need to be either manually provisioned, or automatically provisioned through an additional provisioner.
One way to tackle this trade-off:
- If a network-attached PersistentVolume provider is available in the k8s environment:
- Consider using it: Elastic has historically discouraged network-attached storage, but this is now a viable option, especially for k8s environments running on major cloud providers (for ex. AWS, GCP, Azure) which provide acceptable performance.
- Study its pricing options
- Benchmark it against the expected target cluster usage
- If price and performance are OK: go for it
- Otherwise, consider using local PersistentVolumes:
- Choose a local volume provisioner (Kubernetes Local Volume Static Provisioner is a good start)
- Be aware of local PersistentVolumes operational concerns (described below)
An emptyDir is an ephemeral volume that exists as long as a Pod is running on a given node. If the Pod is scheduled to another node for any reason (e.g. k8s scheduling priorities, host maintenance, node restart), the emptyDir data is deleted forever.
A sizeLimit can be specified (e.g. 1Gi
): once reached, the Pod is automatically evicted, which means its data is lost forever.
The actual underlying storage depends on the Kubernetes environment. It may be a tmpfs
in RAM, HDD/SSD disks, etc.
emptyDirs are useful for ephemeral data that can be lost at any time. They are not recommended for storing Elasticsearch data. They may still be useful for coordinating-only nodes or ingest-only nodes, where losing data does not impact the cluster much. The Volume claim templates section explains how to setup emptyDir volumes with ECK.
hostPath
volumes allow mounting a filesystem path from the host a Pod is running on (e.g. /mnt/data
). No size limit can be provided when using hostPath
: the Pod will simply access the filesystem as it is. Since ECK expects that a Pod can be recreated to reuse the same volume at any time, it is important to ensure that the same Pod can only be scheduled on the same Kubernetes node. Which means every single Pod must have affinity constraints that force it to run on a statically predefined Kubernetes node.
hostPath
volumes do not fit StatefulSets very well: when using hostPath
we likely want a different statically defined affinity constraint for each Pod, which means a single Pod per StatefulSet.
It is definitely possible to use hostPath
volumes for Elasticsearch, but similar things can be achieved with the local PersistentVolumes approach. The latter provides much more flexibility, since it automatically handles the affinity constraints at the volume level (as opposed to the user handling them at the Pod specification level). It also better fits the Kubernetes StatefulSet concept. We recommend using local PersistentVolumes instead of hostPath
.
PersistentVolumes are best used with StatefulSets (logical group of Pods sharing the same specification), which is the resource ECK relies on to manage Elasticsearch. When a Pod is bound to a PersistentVolume, the binding between the two resources remains until manually removed. ECK relies on this binding mechanism during Elasticsearch rolling upgrades: we know a Pod can be safely deleted, then recreated, because it will reuse the same PersistentVolume, hence the same data.
This binding is enforced at Pod scheduling time: if an existing PersistentVolume is already bound to the Pod about to be created (by name), that volume will be used automatically. If the volume is not available (say, it is only available on a single host and that host is down), then the Pod will not be able to start successfully and will stay in a Pending state.
PersistentVolumes are associated with a storageClass
to represent the underlying volume implementation (e.g. local
, gp2
, gke-pd
, etc.). Different storageClass offer different performance. Most cloud providers provide their own storageClass, but the user is free to create any additional storageClass, or setup any additional PersistentVolume provider. This page in the Kubernetes documentation references various storage classes.
A size limit can be provided in the volume specification. The actual enforcement of that limit depends on the provider implementation.
PersistentVolumes are the recommended way of deploying Elasticsearch with ECK.
Let's take an example to understand the Elasticsearch -> StatefulSet -> PersistentVolumeClaim -> PersistentVolume -> Pod relationship.
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-sample
spec:
version: 7.8.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
Each item in spec.nodeSets
maps 1:1 to a StatefulSet. The Elasticsearch specification above is transformed by ECK into a single StatefulSet (since there is a single nodeSet), handled by the StatefulSet controller:
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: "2020-07-03T13:19:22Z"
generation: 1
name: elasticsearch-sample-es-default
namespace: default
spec:
replicas: 3
template: (...)
updateStrategy:
type: OnDelete
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
volumeMode: Filesystem
status:
phase: Pending
In turn, the StatefulSet controller (part of the core Kubernetes controllers) translates the StatefulSet into 3 Pods:
kubectl get pods
elasticsearch-sample-es-default-0
elasticsearch-sample-es-default-1
elasticsearch-sample-es-default-2
And 3 PersistentVolumeClaims:
kubectl get pvc
elasticsearch-data-elasticsearch-sample-es-default-0
elasticsearch-data-elasticsearch-sample-es-default-1
elasticsearch-data-elasticsearch-sample-es-default-2
Each claim represents a desire for a Pod to use a PersistentVolume. The claim is referenced in the Pod resource:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: elasticsearch-data-elasticsearch-sample-es-default-0
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
volumeMode: Filesystem
status:
phase: Pending
apiVersion: v1
kind: Pod
metadata:
name: elasticsearch-sample-es-default-0
namespace: default
spec:
containers: (...)
volumes:
- name: elasticsearch-data
persistentVolumeClaim:
claimName: elasticsearch-data-elasticsearch-sample-es-default-0
status:
phase: Pending
Both the Pod and the PersistentVolumeClaim stay Pending until a PersistentVolume is bound to the PersistentVolumeClaim. PersistentVolumes dynamic provisioners can create the PersistentVolume resource just in time. Once done, the Pod can start with the volume referenced by the claim. See how Pod, PersistentVolumeClaim and PersistentVolume reference each other:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:icsearch.k8s.elastic.co/statefulset-name: elasticsearch-sample-es-default
name: elasticsearch-data-elasticsearch-sample-es-default-0
namespace: defaultb921-713b-4235-b05a-3a504e51d930
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: standard
volumeMode: Filesystem
volumeName: pvc-88c7b921-713b-4235-b05a-3a504e51d930 # reference to the PersistentVolume bound to that PersistentVolumeClaim
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 5Gi
phase: Bound
apiVersion: v1
kind: PersistentVolume
metadata:
name: pvc-88c7b921-713b-4235-b05a-3a504e51d930
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 5Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: elasticsearch-data-elasticsearch-sample-es-default-0
namespace: default
resourceVersion: "505313"
uid: 88c7b921-713b-4235-b05a-3a504e51d930
gcePersistentDisk:
fsType: ext4
pdName: local-njv5q-dynamic-pvc-88c7b921-713b-4235-b05a-3a504e51d930
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-c
- key: failure-domain.beta.kubernetes.io/region
operator: In
values:
- europe-west2
persistentVolumeReclaimPolicy: Delete
storageClassName: standard
volumeMode: Filesystem
status:
phase: Bound
apiVersion: v1
kind: Pod
metadata:
name: elasticsearch-sample-es-default-0
namespace: default
spec:
containers: (...)
volumes:
- name: elasticsearch-data
persistentVolumeClaim:
claimName: elasticsearch-data-elasticsearch-sample-es-default-0
status:
phase: Running
There is a strong deterministic relationship between a Pod and a PersistentVolumeClaim. In the example above, the Pod elasticsearch-sample-es-default-0
relies on a PersistentVolumeClaim named elasticsearch-data-elasticsearch-sample-es-default-0
(<volume name>-<pod-name>
). On Pod creation, if a PersistentVolumeClaim with that name already exists, it is reused. If not, it is created automatically.
As long as the claim exists, and is bound to a volume, the Pod can be deleted and recreated: it will reuse the same volume.
It is very important for the PersistentVolume StorageClass to have volumeBindingMode: WaitForFirstConsumer
set. Otherwise a Pod may be scheduled, because of affinity settings, on a host where the bound PersistentVolume is not available.
Surprisingly, this setting isn't applied by default on many cloud providers default StorageClass. Fortunately the user can create (or patch) their own StorageClass with the same underlying provider, and set volumeBindingMode: WaitForFirstConsumer
.
The reclaim policy of a storageClass specifies whether a PersistentVolume should be automatically deleted once its corresponding PersistentVolumeClaim is deleted. It can be set to Delete
, or Retain
.
ECK deletes PersistentVolumeClaims automatically once they are no longer needed, following a cluster downscale or deletion. However ECK does not delete PersistentVolumes. The user must be careful not to reuse an existing PersistentVolume belonging to a different Elasticsearch cluster, for a new claim. Elasticsearch will refuse to start since it will detect the data belongs to a different cluster. For that reason we generally recommend using the Delete
reclaim policy.
We can distinguish two ways to provision PersistentVolume resources:
-
Static provisioning: Pre-creates all the desired PersistentVolumes before creating the StatefulSet. Upon StatefulSet creation, Kubernetes attempts to bind a pending PersistentVolumeClaim to one of the available PersistentVolumes.
-
Dynamic provisioning: Does not create any PersistentVolume in advance. Instead, a provisioner notices that some PersistentVolumeClaims are pending, and automatically creates the corresponding PersistentVolumes. Kubernetes then attempts to bind the PersistentVolumeClaims to the newly created PersistentVolumes.
Most network-attached PersistentVolumes rely on dynamic provisioning: you only want a volume to be provisioned, and pay for it, once a Pod requires it. You also likely want the volume to have the exact size you expect (not more). Cloud providers come up with their own dynamic provisioners by default, but it's also possible to deploy your own.
Static provisioning is more often associated to local volumes: you want one volume per device, on each node, with a fixed size (corresponding to the physical or logical partition size). Volumes stay around, pending a Pod request. A Pod may request less than the available volumes size: it can still be bound to a volume with a larger size. Static provisioning can be achieved manually (create the PersistentVolume resource "by hand"), or through a static provisioner (a process on each host detects existing disks automatically and creates the corresponding PersistentVolumes).
PersistentVolumes come in two flavours, with a different performance/price/operations trade-off. From ECK's perspective there is no difference, both are handled the same way.
Network-attached PersistentVolumes can generally be attached to a Pod regardless of the host it is scheduled on. This gives a huge operational benefit: if that host goes down, or needs to be replaced, the Pod can simply be removed from it. It will be rescheduled automatically on a different node (generally in the same zone), and reattached to the same volume. This can take only a few seconds, and does not require any human intervention. The downside is performance: IOps and latency is generally not as good as locally-attached disks. The fastest volumes from major Cloud providers (GKE/AKS/EKS) still give good performance (see the Peformance section below). Most cloud providers charge per GB-month and IOPS-month.
Local PersistentVolumes are bound to a particular host, and generally map a directory on the host filesystem. The Pod can only be scheduled on that host in order to reuse the same volume. This is handled by affinity settings in the PersistentVolume itself. If that host goes down, or needs to be replaced, the Pod will not be scheduled on another host. It remains in a Pending
state until the host comes back, or until the PersistentVolumeClaim is deleted (manual operation). Performance is as good as the underlying device and filesystem performance. Cloud providers generally charge per mounted local disk.
Benchmarks are complicated, and depend a lot on the actual cluster usage. Generally speaking, locally attached SSDs give better performance per dollar, but the fastest network-attached volumes (e.g. AWS io1
) seem to give decent performance.
See the benchmarks page for numbers and a more detailed comparison between network-attached and local storage options. This other benchmark page compares Rally http_logs
track with what is published on elasticsearch-benchmarks.
We recommend benchmarking the available storage options on your environment before drawing any conclusion.
To temporarily remove a Kubernetes node for maintenance purposes, we usually cordon it (mark it as unschedulable), and drain it (kubectl drain <node>
), to force all Pods to be rescheduled elsewhere. With local PersistentVolumes, the evicted Pod can only be scheduled on the same Kubernetes node. If the node is unschedulable (or not available anymore), the Pod stays Pending
. Once the host is back in the cluster again, the Pod is automatically started and reuses the same PersistentVolume.
ECK sets up a default PodDisruptionBudget to control how many Pods per Elasticsearch cluster can be safely evicted during node drains. When the Elasticsearch cluster has a green health, the PDB allows one Pod disruption. Considering a single Elasticsearch node per Kubernetes host, it means hosts can be upgraded one by one, as long as Elasticsearch is healthy. It may take some time for the cluster to become green again between host upgrades.
Some cloud provider Kubernetes offerings only respect the PodDisruptionBudget for a certain amount of time, before killing all Pods on the node. For example, GKE automated version upgrade rotates all nodes (without preserving local volumes), and only respect the PodDisruptionBudget for 1 hour. In such cases it is preferable to manually handle the version upgrade: drain the node pool and re-schedule the Pods on a different (up-to-date) node pool.
If a Kubernetes node experiences a failure, or is permanently removed from the cluster, the local data is likely lost. The corresponding Pod will stay Pending
because it cannot be attached its unrecoverable PersistentVolume anymore.
When we know that the host will not come back alive, we likely want the Pod to be scheduled again on a new host with an empty data volume, and rely on Elasticsearch shard replication to recover the missing data. This can be done by manually removing both the PersistentVolumeClaim and the Pod resources. A new Pod with the same name is created with a new PersistentVolumeClaim (empty data). That Pod can be scheduled on any of the other available Kubernetes nodes.
This "manual" process could be automated by deleting the PVC before removing the node, or by setting up a process that automatically deletes PVCs when the Node resource is deleted from Kubernetes. See these GitHub issues: kubernetes-sigs/sig-storage-local-static-provisioner#201, kubernetes-sigs/sig-storage-local-static-provisioner#181, kubernetes/enhancements#1116.
For cases where node removal is planned in advance, we could eventually introduce a way to automatically migrate data away from the Elasticsearch node before deleting it.
As part of rolling upgrades (configuration change, version upgrade, etc.), ECK deletes existing Pods one by one. Those Pods are automatically recreated with the newer specification, and reuse the same PersistentVolumes. In the small time window between Pod deletion and re-creation, the scheduler may allocate a different Pod on the host. At that point the host may not have enough spare resources (CPU, RAM) anymore to allocate the replacing Elasticsearch Pod, which then stays Pending
. Although rare, this might still happen on production systems with concurrent Pod creations.
It can be worked around by setting a high priority-class to Pods that rely on local volumes. If the Kubernetes node with the existing local volume does not have spare resources, Kubernetes starts evicting lower-priority Pods on that node in order to schedule the higher-priority ones.
Another way to deal with this is to use taints and tolerations. They allow dedicating Kubernetes nodes to Elasticsearch workloads.
It may also help to use fixed RAM/storage ratios, and optimize for having the storage entirely used if the other node resources are entirely used. For example, if we always associate 8GB RAM with 1TB storage, a 32GB host (4TB storage) can hold 2x8GB Pods (2x1TB) + 1x16GB Pod (1x2TB). When a 8GB Pod is temporarily deleted for recreation, no other Pod can be scheduled concurrently since there's no volume available on the host.
Imagine a situation where a user wants to increase an existing Elasticsearch Pod RAM requirements from 8GB to 32GB. Since the Pod exists with a local PersistentVolume bound to a particular Kubernetes node, the replacing 32GB Pod must be scheduled on the same Kubernetes node. However nothing guarantees that the Kubernetes node has 32GB RAM available. ECK relies entirely on the Kubernetes scheduler, and does not inspect the nodes specification in any way. To perform the RAM increase, ECK deletes the existing Pod, which is recreated automatically, and reuses the same PersistentVolume. However, if the host does not have 32GB RAM available, the Pod stays Pending
forever. There are several ways out of this situation:
- restore the previous RAM requirements so the Pod can be scheduled again on the same host
- manually delete PVC and Pod (acknowledging data loss - which should be fine since ECK ensured the cluster is green before deleting the Pod), so they can be scheduled elsewhere
- ensure Pods with the newer spec are created on a different Kubernetes node, with data migrated over, by renaming the nodeSet
When using local volumes, make sure Elasticsearch indices have at least one replica to guarantee shard availability.
There are multiple ways to configure Kubernetes scheduling priorities to optimize resource allocation. However, the default Kubernetes scheduler is not aware of any storage capacity constraint. It cannot favor hosts with the most (or least) remaining storage capacity.
- Manual provisioning
PersistentVolumes can be manually created to rely on the hostPath
mechanism. The following examples set up a local PersistentVolume bound to the node my-node
, where data is mounted on the host from /mnt/mydata
:
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-local-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-storage
local:
path: /mnt/mydata
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- my-node
It can be bound to a claim that specifies the same storageClassName
:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: local-storage
Maintained by the Kubernetes community and part of sig-storage, this provisioner creates PersistentVolumes by auto-discovering each disk on the host (for example, one PersistentVolume per directory in /mnt/disks/*
). It can also handle partition formatting (for example, ext4). Once a volume has been released, it gets cleaned up automatically and a new fresh replacing PersistentVolume is created. The provisioner is a good fit for use cases where an entire disk or partition should be dedicated to a single volume.
This provisioner handles dynamic PersistentVolume provisioning, based on a list of filesystem directories where hostPath volumes can be created (one sub-dir per volume). It can be configured with the provisioning path(s) for each host. It handles cleaning up data on volume removal. Even though a PersistentVolumeClaim can specify a storage size (e.g. 10GB), the provisioner does not enforce any capacity check on the volume itself (the underlying filesystem usage can grow larger than 10GB). The provisioner fits well for use cases where multiple volumes need to be dynamically provisioned, and disk usage enforcement is not a concern.
OpenEBS is a storage solution for Kubernetes that supports several volume types, including hostPath local volumes and block device local volumes.
From the Openshift docs, it looks like this operator allows users to define their volume using a LocalVolume
custom resource. I could not find many implementation details.
TopoLVM is a CSI plugin for LVM volumes. It is able to dynamically provision PersistentVolumes of the desired size, with the benefits of using LVM (multiple disks into a single logical volume, volume expansion, thin provisioning, etc.). It extends the Kubernetes scheduler to be capacity-aware, in order to prioritize nodes with the larger remaining capacity. It requires lvmd
to be installed on all hosts. The user must have permissions to configure the extended scheduler, which is unfortunately not possible on most cloud providers Kubernetes offerings.
In this example, we'll deploy a production-grade Elasticsearch cluster on GKE, using the Kubernetes Local Volume Static Provisioner, with:
- 3 dedicated master nodes (4Gi RAM, 1 CPU, 10GB storage)
- 6 data nodes (50Gi RAM, 15 CPU, 3TB storage)
All nodes rely on local SSD PersistentVolumes. Since GKE is limited to 375GB local SSDs, we setup a RAID 0 from 8 local SSDs for data nodes.
First, let's create a GKE Kubernetes cluster with 9 nodes in 2 different node pools: one for master nodes, another one for data nodes.
gcloud beta container --project "elastic-cloud-dev" clusters create "seb-localpv-cluster" --region "us-central1" --no-enable-basic-auth --cluster-version "1.16.9-gke.6" --machine-type "n1-standard-2" --image-type "UBUNTU" --disk-type "pd-standard" --disk-size "100" --local-ssd-count "1" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "1" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/elastic-cloud-dev/global/networks/default" --subnetwork "projects/elastic-cloud-dev/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --no-enable-autoupgrade --no-enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 && gcloud beta container --project "elastic-cloud-dev" node-pools create "data-pool" --cluster "seb-localpv-cluster" --region "us-central1" --node-version "1.16.9-gke.6" --machine-type "n1-standard-16" --image-type "UBUNTU" --disk-type "pd-standard" --disk-size "100" --local-ssd-count "8" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --no-enable-autoupgrade --no-enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0
Then, deploy the static provisioner, with a local-storage
storageClass.
The provisioner runs as DaemonSet on each Kubernetes node, and auto-discovers any disk in /mnt/disks
.
As part of an initContainer, we run an extra bash script to assemble a RAID 0 array from all available SSDs, formatted to ext4. Simpler setups with no RAID can remove or tweak the local-ssd-startup
init container.
The bash script comes from https://github.com/pingcap/tidb-operator/blob/master/manifests/gke/local-ssd-provision/local-ssd-provision.yaml.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: ConfigMap
metadata:
name: local-provisioner-config
namespace: kube-system
data:
setPVOwnerRef: "true"
nodeLabelsForPV: |
- kubernetes.io/hostname
storageClassMap: |
local-storage:
hostDir: /mnt/disks
mountDir: /mnt/disks
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: local-volume-provisioner
namespace: kube-system
labels:
app: local-volume-provisioner
spec:
selector:
matchLabels:
app: local-volume-provisioner
template:
metadata:
labels:
app: local-volume-provisioner
spec:
hostPID: true
nodeSelector:
cloud.google.com/gke-local-ssd: "true"
serviceAccountName: local-storage-admin
initContainers:
- name: local-ssd-startup
image: alpine
command: ['/bin/sh', '-c', 'nsenter -t 1 -m -u -i -n -p -- bash -c "${STARTUP_SCRIPT}"']
securityContext:
privileged: true
volumeMounts:
- mountPath: /mnt/disks
name: local-disks
mountPropagation: Bidirectional
env:
- name: STARTUP_SCRIPT
value: |
#!/usr/bin/env bash
set -euo pipefail
set -x
# discard,nobarrier are required to optimize local SSD
# performance in GCP, see
# https://cloud.google.com/compute/docs/disks/performance#optimize_local_ssd
mnt_opts="defaults,nodelalloc,noatime,discard,nobarrier"
# use /var because it is writeable on COS
if ! findmnt -n -a -l | grep /mnt/disks/ssd ; then
if test -f /var/ssd_mounts ; then
ssd_mounts=$(cat /var/ssd_mounts)
else
echo "no ssds mounted yet"
exit 1
fi
else
ssd_mounts=$(findmnt -n -a -l --nofsroot | grep /mnt/disks/ssd)
echo "$ssd_mounts" > /var/ssd_mounts
fi
# Re-mount all disks as a single logical volume with a UUID
if old_mounts=$(findmnt -n -a -l --nofsroot | grep /mnt/disks/ssd) ; then
echo "$old_mounts" | awk '{print $1}' | while read -r ssd ; do
umount "$ssd"
new_fstab=$(grep -v "$ssd" /etc/fstab) || echo "fstab is now empty"
echo "$new_fstab" > /etc/fstab
done
fi
echo "$ssd_mounts" | awk '{print $1}' | while read -r ssd ; do
if test -d "$ssd"; then
rm -r "$ssd"
fi
done
devs=$(echo "$ssd_mounts" | awk '{print $2}')
raid_dev=/dev/md0
# If RAID or LVM is already in use, this may have been re-deployed.
# Don't try to change the disks.
pvs=$((test -x /sbin/pvs && /sbin/pvs) || echo "")
if ! test -e $raid_dev && ! echo "$pvs" | grep volume_all_ssds ; then
# wipe all devices
echo "$devs" | while read -r dev ; do
dev_basename=$(basename "$dev")
mkdir -p /var/dev_wiped/
if ! test -f /var/dev_wiped/$dev_basename ; then
if findmnt -n -a -l --nofsroot | grep "$dev" ; then
echo "$dev" already individually mounted
exit 1
fi
/sbin/wipefs --all "$dev"
touch /var/dev_wiped/$dev_basename
fi
done
# Don't combine if there is 1 disk or the environment variable is set.
# lvm and mdadm do have overhead, so don't use them if there is just 1 disk
# remount with uuid, set mount options (nobarrier), and exit
NO_COMBINE_LOCAL_SSD="${NO_COMBINE_LOCAL_SSD:-""}"
if ! test -z "$NO_COMBINE_LOCAL_SSD" || [ "$(echo "$devs" | wc -l)" -eq 1 ] ; then
echo "$devs" | while read -r dev ; do
if ! findmnt -n -a -l --nofsroot | grep "$dev" ; then
if ! uuid=$(blkid -s UUID -o value "$dev") ; then
mkfs.ext4 "$dev"
uuid=$(blkid -s UUID -o value "$dev")
fi
mnt_dir="/mnt/disks/$uuid"
mkdir -p "$mnt_dir"
if ! grep "$uuid" /etc/fstab ; then
echo "UUID=$uuid $mnt_dir ext4 $mnt_opts" >> /etc/fstab
fi
mount -U "$uuid" -t ext4 --target "$mnt_dir" --options "$mnt_opts"
chmod a+w "$mnt_dir"
fi
done
exit 0
fi
fi
new_dev=
USE_LVM="${USE_LVM:-""}"
# If RAID is available use it because it performs better than LVM
if test -e $raid_dev || (test -x /sbin/mdadm && test -z "$USE_LVM") ; then
if ! test -e $raid_dev ; then
echo "$devs" | xargs /sbin/mdadm --create $raid_dev --level=0 --raid-devices=$(echo "$devs" | wc -l)
sudo mkfs.ext4 -F $raid_dev
new_dev=$raid_dev
fi
else
if ! echo "$pvs" | grep volume_all_ssds ; then
echo "$devs" | xargs /sbin/pvcreate
fi
/sbin/pvdisplay
if ! /sbin/vgs | grep volume_all_ssds ; then
echo "$devs" | xargs /sbin/vgcreate volume_all_ssds
fi
/sbin/vgdisplay
if ! /sbin/lvs | grep logical_all_ssds ; then
/sbin/lvcreate -l 100%FREE -n logical_all_ssds volume_all_ssds
fi
/sbin/lvdisplay
new_dev=/dev/volume_all_ssds/logical_all_ssds
fi
if ! uuid=$(blkid -s UUID -o value $new_dev) ; then
mkfs.ext4 $new_dev
uuid=$(blkid -s UUID -o value $new_dev)
fi
mnt_dir="/mnt/disks/$uuid"
mkdir -p "$mnt_dir"
if ! grep "$uuid" /etc/fstab ; then
echo "UUID=$uuid $mnt_dir ext4 $mnt_opts" >> /etc/fstab
fi
mount -U "$uuid" -t ext4 --target "$mnt_dir" --options "$mnt_opts"
chmod a+w "$mnt_dir"
containers:
- image: "quay.io/external_storage/local-volume-provisioner:v2.3.4"
name: provisioner
securityContext:
privileged: true
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 100m
memory: 100Mi
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: MY_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: JOB_CONTAINER_IMAGE
value: "quay.io/external_storage/local-volume-provisioner:v2.3.2"
volumeMounts:
- mountPath: /etc/provisioner/config
name: provisioner-config
readOnly: true
- mountPath: /mnt/disks
name: local-disks
mountPropagation: "HostToContainer"
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- name: provisioner-config
configMap:
name: local-provisioner-config
- name: local-disks
hostPath:
path: /mnt/disks
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: local-storage-admin
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: local-storage-provisioner-pv-binding
namespace: kube-system
subjects:
- kind: ServiceAccount
name: local-storage-admin
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:persistent-volume-provisioner
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: local-storage-provisioner-node-clusterrole
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: local-storage-provisioner-node-binding
namespace: kube-system
subjects:
- kind: ServiceAccount
name: local-storage-admin
namespace: kube-system
roleRef:
kind: ClusterRole
name: local-storage-provisioner-node-clusterrole
apiGroup: rbac.authorization.k8s.io
Check that one PersistentVolume with a Delete
reclaim policy has been created for each host:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-36f2b019 2950Gi RWO Delete Available local-storage 7m56s
local-pv-3be19c8f 2950Gi RWO Delete Available local-storage 7m37s
local-pv-5543ea00 2950Gi RWO Delete Available local-storage 7m38s
local-pv-bd1aac10 2950Gi RWO Delete Available local-storage 7m38s
local-pv-c3f4f33 368Gi RWO Delete Available local-storage 8m22s
local-pv-d870e432 368Gi RWO Delete Available local-storage 8m21s
local-pv-e5615568 368Gi RWO Delete Available local-storage 8m21s
local-pv-e693f7e8 2950Gi RWO Delete Available local-storage 7m50s
local-pv-ed30daed 2950Gi RWO Delete Available local-storage 7m38s
Create an Elasticsearch cluster with 3 dedicated master nodes and 6 data nodes. Both rely on the local-storage
storageClass we just created.
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: mycluster
spec:
version: 7.8.0
nodeSets:
- name: master-nodes
count: 3
config:
node.master: true
node.data: false
node.ingest: false
node.ml: false
node.store.allow_mmap: false
podTemplate:
spec:
affinity:
# schedule master nodes on the default-pool
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values: ["default-pool"]
# don't put two master nodes on the same host
# (note this is already set by ECK, by default)
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: mycluster
topologyKey: kubernetes.io/hostname
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms2g -Xmx2g
resources:
requests:
memory: 4Gi
cpu: 1
limits:
memory: 4Gi
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: local-storage
- name: data-nodes
count: 6
config:
node.master: false
node.data: true
node.ingest: true
node.ml: true
node.store.allow_mmap: false
podTemplate:
spec:
affinity:
# schedule data nodes on the data-pool
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values: ["data-pool"]
# don't put two data nodes on the same host
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
elasticsearch.k8s.elastic.co/cluster-name: mycluster
topologyKey: kubernetes.io/hostname
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms25g -Xmx25g
resources:
requests:
memory: 50Gi
cpu: 15
limits:
memory: 50Gi
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2950Gi
storageClassName: local-storage
The 3 Pods have been created and their volumes are bound:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-0 1/1 Running 0 55s
mycluster-es-data-nodes-1 1/1 Running 0 55s
mycluster-es-data-nodes-2 1/1 Running 0 55s
mycluster-es-data-nodes-3 1/1 Running 0 55s
mycluster-es-data-nodes-4 1/1 Running 0 55s
mycluster-es-data-nodes-5 1/1 Running 0 55s
mycluster-es-master-nodes-0 1/1 Running 0 56s
mycluster-es-master-nodes-1 1/1 Running 0 56s
mycluster-es-master-nodes-2 1/1 Running 0 56s
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
elasticsearch-data-mycluster-es-data-nodes-0 Bound local-pv-bd1aac10 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-data-nodes-1 Bound local-pv-e693f7e8 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-data-nodes-2 Bound local-pv-36f2b019 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-data-nodes-3 Bound local-pv-5543ea00 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-data-nodes-4 Bound local-pv-ed30daed 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-data-nodes-5 Bound local-pv-3be19c8f 2950Gi RWO local-storage 73s
elasticsearch-data-mycluster-es-master-nodes-0 Bound local-pv-e5615568 368Gi RWO local-storage 74s
elasticsearch-data-mycluster-es-master-nodes-1 Bound local-pv-d870e432 368Gi RWO local-storage 74s
elasticsearch-data-mycluster-es-master-nodes-2 Bound local-pv-c3f4f33 368Gi RWO local-storage 74s
Kill a Pod, it should be recreated automatically with the same PVC and PV:
kubectl delete pod mycluster-es-data-nodes-0
Decrease the number of data nodes by editing the count of the second nodeSet. There are now 5 Elasticsearch data Pods left. The PersistentVolume of the deleted Pod should be automatically released, cleaned up, then recreated and marked available:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-0 1/1 Running 0 2m38s
mycluster-es-data-nodes-1 1/1 Running 0 4m52s
mycluster-es-data-nodes-2 1/1 Running 0 4m52s
mycluster-es-data-nodes-3 1/1 Running 0 4m52s
mycluster-es-data-nodes-4 1/1 Running 0 4m52s
mycluster-es-master-nodes-0 1/1 Running 0 4m53s
mycluster-es-master-nodes-1 1/1 Running 0 4m53s
mycluster-es-master-nodes-2 1/1 Running 0 4m53s
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-36f2b019 2950Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-data-nodes-2 local-storage 5m32s
local-pv-3be19c8f 2950Gi RWO Delete Available local-storage 3s
local-pv-5543ea00 2950Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-data-nodes-3 local-storage 38m
local-pv-bd1aac10 2950Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-data-nodes-0 local-storage 5m34s
local-pv-c3f4f33 368Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-master-nodes-2 local-storage 5m27s
local-pv-d870e432 368Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-master-nodes-1 local-storage 5m27s
local-pv-e5615568 368Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-master-nodes-0 local-storage 5m27s
local-pv-e693f7e8 2950Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-data-nodes-1 local-storage 5m26s
local-pv-ed30daed 2950Gi RWO Delete Bound default/elasticsearch-data-mycluster-es-data-nodes-4 local-storage 38m
Drain a Kubernetes node holding one of the data nodes:
kubectl drain gke-seb-localpv-cluster-data-pool-58b0702c-2754 --ignore-daemonsets --delete-local-data
Notice how the Pod mycluster-es-data-nodes-1
got terminated, and is now Pending because it cannot be scheduled on the Kubernetes node holding its PersistentVolume:
kubectl get pods
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-0 1/1 Running 0 4m52s
mycluster-es-data-nodes-1 0/1 Pending 0 6s
mycluster-es-data-nodes-2 1/1 Running 0 7m6s
mycluster-es-data-nodes-3 1/1 Running 0 7m6s
mycluster-es-data-nodes-4 1/1 Running 0 7m6s
mycluster-es-master-nodes-0 1/1 Running 0 7m7s
mycluster-es-master-nodes-1 1/1 Running 0 7m7s
mycluster-es-master-nodes-2 1/1 Running 0 7m7s
kubectl describe pod mycluster-es-data-nodes-1
(...)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 58s default-scheduler 0/9 nodes are available: 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 3 node(s) didn't match node selector, 7 Insufficient cpu, 7 Insufficient memory
Simulate the Kubernetes node coming back online:
kubectl uncordon gke-seb-localpv-cluster-data-pool-58b0702c-2754
The Pod should automatically start on the node, and reattach to the existing volume:
kubectl get pod mycluster-es-data-nodes-1
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-1 1/1 Running 0 2m10s
Now let's simulate a complete Kubernetes node removal (or failure). Drain the node again:
kubectl drain gke-seb-localpv-cluster-data-pool-58b0702c-2754 --ignore-daemonsets --delete-local-data
The Pod is Pending
:
kubectl get pod mycluster-es-data-nodes-1
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-1 0/1 Pending 0 9s
Let's consider the data is lost and unrecoverable. We'd like the Pending
Pod to be scheduled on another available host, and start with an empty data volume. In order to do so, we need to remove both PersistentVolumeClaim and Pod:
kubectl delete pvc elasticsearch-data-mycluster-es-data-nodes-1
kubectl delete pod mycluster-es-data-nodes-1
The Pod is now automatically started on the available Kubernetes node, with a new empty local volume:
kubectl get pod mycluster-es-data-nodes-1
NAME READY STATUS RESTARTS AGE
mycluster-es-data-nodes-1 1/1 Running 0 33s