Things on my radar but not yet (OUTDATED)

Tecnology and Society

http://appropriatingtechnology.org/?q=node/296

Kubernetes autoscaling

Kubernetes management

Linux process scheduling

cgroups

X in cgroups/containers

Java: https://engineering.linkedin.com/blog/2016/11/application-pauses-when-running-jvm-inside-linux-control-groups

Network?

https://gist.github.com/CMCDragonkai/6bfade6431e9ffb7fe88
HTTP keep-alive
HTTP/2
Head of Line blocking

distributed

performance tools

https://accelazh.github.io/linux/Understand-System-Performance-Commands

tools

https://www.aosabook.org/en/nginx.html
HAProxy architecture

monitoring

envoy (in practice)

Cloud Native

Concurrecy control

Multi-zone

K8s supports running a single cluster in multiple failure zones (zones in GCP, availability zones in AWS). A single k8s cluster is limited to a single region (and cloud provider). Multi-cloud providers and multi-region requires multiple clusters.

Pods in a replication controller or service are automatically spread across zones.
- What is SelectorSpreadPriority?
  - https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#scoring
- What is the effect of podAntiAffinity with topologyKey: "failure-domain.beta.kubernetes.io/zone" then?
  - Example: https://blog.verygoodsecurity.com/posts/kubernetes-multi-az-deployments-using-pod-anti-affinity/
- How does it interact with topologySpreadConstraints (1.16)?
  - https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
There is no zone-aware (service) routing as per 1.16
- traffic that goes via services might cross zones
- assumes different zones are located close to each other in the network
topology-aware service routing is planned for 1.17 (kubernetes/kubernetes#72046)
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20181024-service-topology.md
ingress-nginx also does not have zone-aware routing (https://github.com/kubernetes/ingress-nginx/blob/master/docs/enhancements/20190815-zone-aware-routing.md)

However, Istio enables locality load balancing by default.

region / zone / sub-zone are automatically configured from k8s well-known annotations.
a Service must be associated with the caller for Istio to determine locality.
outlier detection must be configured in a DestinationRule for each service to determine health.
- istio/istio#4702 (comment)
- https://istio.io/docs/ops/configuration/traffic-management/locality-load-balancing/

Constraints of having an EBS-backed PV in a multi-zone cluster:

kubernetes/kops#6267 (comment)

Docs:

Kubernnetes Security

https://github.com/trailofbits/audit-kubernetes Security audit. Look at the threat model and issues raised (https://github.com/trailofbits/audit-kubernetes/issues?q=is%3Aissue+is%3Aclosed).
https://kubernetes.io/docs/concepts/security/overview/

QoS and oversubscription

CPU Limit (and throttling)

Good video explaining the problem: https://www.youtube.com/watch?v=UE7QX98-kO0
k8s issue: kubernetes/kubernetes#67577
- kubernetes/kubernetes#67577
- kubernetes/kubernetes#51135
EKS issue: aws/containers-roadmap#175

Prometheus

This is a v. good intro to the 4 types of metrics:

How other metric collection systems integrate with prometheus metrics?

https://docs.datadoghq.com/integrations/prometheus/#metrics <histogram>.count with upper_bound tag.

There is also this free course: https://training.robustperception.io/p/introduction-to-prometheus

alertmanager
thanos
cortex

InfluxDB

Ingress-nginx

ingress-nginx 0.26.0+ takes up to 300s (5 minutes) to terminate while waiting for termination of incoming connections. See release notes for: https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.26.0

How long does it take for a pod scheduled for deletion to be removed from the list of backends across all ingress controller instances? Is it configurable?
1. How long does it take to propagate a removal of pod to its Endpoints?
2. How long does it take to propagate an Endpoint change to the "Lua handler" (https://kubernetes.github.io/ingress-nginx/how-it-works/#avoiding-reloads-on-endpoints-changes) ?
3. Can these things be measured? Is there any metric for this?