- kubernetes/kubernetes#1629
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md
- Scale to/from zero
- https://medium.com/condenastengineering/k8s-federation-v2-a-guide-on-how-to-get-started-ec9cc26b1fa7
- https://medium.com/condenastengineering/clusterapi-a-guide-on-how-to-get-started-ff9a81262945
- https://www.nickaws.net/aws/elixir/2019/09/02/Federation-and-EKS.html
- https://www.infoq.com/podcasts/kubernetes-self-service-cluster-api/
- https://engineering.squarespace.com/blog/2017/understanding-linux-container-scheduling
- Completely Fair Scheduler (CFS):
- https://docs.docker.com/engine/reference/run/#cpu-share-constraint
- https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/
- https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/
- http://man7.org/linux/man-pages/man7/namespaces.7.html
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cgroups.html
- http://man7.org/linux/man-pages/man7/cgroups.7.html
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01
- https://research.google/pubs/pub36669/
- https://gist.github.com/CMCDragonkai/6bfade6431e9ffb7fe88
- HTTP keep-alive
- HTTP/2
- Head of Line blocking
- https://474.cmpt.sfu.ca/design-space.html
- https://474.cmpt.sfu.ca/resources.html
- https://474.cmpt.sfu.ca/schedule.html
- https://accelazh.github.io/storage/Tail-Latency-Study
- https://jepsen.io/consistency
- https://accelazh.github.io/cloud/A-Summary-of-Cloud-Scheduling
- https://474.cmpt.sfu.ca/Week3-Mon.html
- https://web.archive.org/web/20180227095215/http://474.cmpt.sfu.ca/public/Week4-Fri.html
- https://accelazh.github.io/storage/Build-My-Academic-Paper-Feedback-Network
- http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/TheTailAtScale.pdf
- https://www.aosabook.org/en/nginx.html
- HAProxy architecture
- https://www.infoq.com/articles/monitoring-SRE-golden-signals/
- https://web.archive.org/web/20171023173225/https://www.vividcortex.com/blog/monitoring-and-observability-with-use-and-red
- https://medium.com/faun/how-to-monitor-the-sre-golden-signals-1391cadc7524
- https://accelazh.github.io/failure/Summarizing-Production-Server-Failure-Modes
- https://accelazh.github.io/storage/Storage-Reliability-Calculations
- https://www.envoyproxy.io/docs/envoy/latest/
- https://medium.com/@copyconstruct/envoy-953c340c2dca
- https://blog.christianposta.com/microservices/01-microservices-patterns-with-envoy-proxy-part-i-circuit-breaking/
- https://dzone.com/articles/istio-circuit-breaker-with-outlier-detection
- https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster/outlier_detection.proto
- https://blog.turbinelabs.io/a-guide-to-envoys-backpressure-22eec025ef04
- https://www.javacodegeeks.com/2018/01/comparing-envoy-istio-circuit-breaking-netflix-oss-hystrix.html
- https://unofficialism.info/posts/envoy-proxy-demos/
- https://developers.redhat.com/blog/2017/05/31/microservices-patterns-with-envoy-sidecar-proxy-part-i-circuit-breaking/
- https://developers.redhat.com/blog/2017/06/01/microservices-patterns-with-envoy-proxy-part-ii-timeouts-and-retries/
- https://developers.redhat.com/blog/2017/06/08/microservices-patterns-with-envoy-proxy-part-iii-distributed-tracing/
- https://blog.christianposta.com/microservices/advanced-traffic-shadowing-patterns-for-microservices-with-istio-service-mesh/
- https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/zone_aware
- https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/zone_aware_routing
- https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/locality_weight
- https://www.envoyproxy.io/docs/envoy/latest/faq/configuration/timeouts.html?highlight=timeout
- https://thenewstack.io/lyfts-envoy-provides-move-monolith-soa/
- https://learning.oreilly.com/library/view/introducing-istio-service/9781491988770/ch04.html
- https://www.eightypercent.net/post/layers-in-the-stack.html
- https://www.eightypercent.net/post/new-container-image-format.html
- https://github.com/brendandburns/designing-distributed-systems-labs
- https://www.infoq.com/articles/oam-alibaba/
- https://azure.microsoft.com/en-us/resources/designing-distributed-systems/
- https://docs.google.com/presentation/d/1P7lg13Rw21NQ59ts22PTzI1q81nnYVK-cOVhYTzg9tg/edit#slide=id.g2876e98c14_1_3
- Azure/AKS#1373
- envoyproxy/envoy#7789
- https://github.com/Netflix/concurrency-limits
- https://github.com/envoyproxy/nighthawk
- https://github.com/tonya11en/bufferbloater
K8s supports running a single cluster in multiple failure zones
(zones
in GCP, availability zones
in AWS).
A single k8s cluster is limited to a single region (and cloud provider). Multi-cloud providers and multi-region requires multiple clusters.
- Pods in a replication controller or service are automatically spread across zones.
- What is
SelectorSpreadPriority
? - What is the effect of
podAntiAffinity
withtopologyKey: "failure-domain.beta.kubernetes.io/zone"
then? - How does it interact with
topologySpreadConstraints
(1.16
)?
- What is
- There is no zone-aware (service) routing as per
1.16
- traffic that goes via services might cross zones
- assumes different zones are located close to each other in the network
topology-aware service routing
is planned for1.17
(kubernetes/kubernetes#72046)ingress-nginx
also does not have zone-aware routing (https://github.com/kubernetes/ingress-nginx/blob/master/docs/enhancements/20190815-zone-aware-routing.md)
However, Istio enables locality load balancing
by default.
region
/zone
/sub-zone
are automatically configured from k8s well-known annotations.- a
Service
must be associated with the caller for Istio to determine locality. - outlier detection must be configured in a
DestinationRule
for each service to determine health.
Constraints of having an EBS-backed PV in a multi-zone cluster:
Docs:
- https://kubernetes.io/docs/setup/best-practices/multiple-zones/
- https://istio.io/docs/ops/traffic-management/locality-load-balancing/
- https://istio.io/docs/reference/config/istio.mesh.v1alpha1/#LocalityLoadBalancerSetting
- https://github.com/trailofbits/audit-kubernetes Security audit. Look at the threat model and issues raised (https://github.com/trailofbits/audit-kubernetes/issues?q=is%3Aissue+is%3Aclosed).
- https://kubernetes.io/docs/concepts/security/overview/
- https://twitter.com/bgrant0607/status/1153342318277083137
- hashicorp/nomad#606 (comment)
- https://threadreaderapp.com/user/bgrant0607 (in general)
- Good video explaining the problem: https://www.youtube.com/watch?v=UE7QX98-kO0
- k8s issue: kubernetes/kubernetes#67577
- EKS issue: aws/containers-roadmap#175
This is a v. good intro to the 4 types of metrics:
- Counter: https://www.robustperception.io/how-does-a-prometheus-counter-work
- Gauge: https://www.robustperception.io/how-does-a-prometheus-gauge-work
- Summary: https://www.robustperception.io/how-does-a-prometheus-summary-work
- Histogram: https://www.robustperception.io/how-does-a-prometheus-histogram-work
How other metric collection systems integrate with prometheus metrics?
- https://docs.datadoghq.com/integrations/prometheus/#metrics
<histogram>.count
withupper_bound
tag.
There is also this free course: https://training.robustperception.io/p/introduction-to-prometheus
- alertmanager
- thanos
- cortex
- https://github.com/influxdata/influxdb
- https://github.com/influxdata/chronograf
- https://github.com/influxdata/kapacitor
ingress-nginx 0.26.0+ takes up to 300s (5 minutes) to terminate while waiting for termination of incoming connections. See release notes for: https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.26.0
- How long does it take for a pod scheduled for deletion to be removed from the list of backends across all ingress controller instances? Is it configurable?
- How long does it take to propagate a removal of pod to its
Endpoint
s? - How long does it take to propagate an
Endpoint
change to the "Lua handler" (https://kubernetes.github.io/ingress-nginx/how-it-works/#avoiding-reloads-on-endpoints-changes) ? - Can these things be measured? Is there any metric for this?
- How long does it take to propagate a removal of pod to its
- TUF (https://github.com/theupdateframework/specification/blob/master/tuf-spec.md)
- in-toto (https://in-toto.io/)
- Grafeas (https://grafeas.io/)
- Kritis (https://github.com/grafeas/kritis)
- Podman (https://podman.io/)
https://github.com/GoogleContainerTools/kaniko
https://azure.microsoft.com/en-us/topic/what-is-kubernetes/
- CNAB (https://github.com/deislabs/cnab-spec)
- Duffle (https://github.com/cnabio/duffle)
- Porter (https://porter.sh/)
- Helm (https://v3.helm.sh/)
- Keda (https://github.com/kedacore/keda) - and https://cloudevents.io/
- OPA (https://www.openpolicyagent.org/)
- Brigade (https://brigade.sh/) - and https://github.com/brigadecore/kashti
- Draft (https://draft.sh/)
https://daemonza.github.io/2017/02/20/using-helm-to-deploy-to-kubernetes/ https://medium.com/@gajus/the-missing-ci-cd-kubernetes-component-helm-package-manager-1fe002aac680 https://cloudblogs.microsoft.com/opensource/2019/05/06/announcing-keda-kubernetes-event-driven-autoscaling-containers/
https://github.com/helm/helm/releases/tag/v3.0.0-rc.3
Spinakker and Kayenta
Flagger
http://port.us.org/ vs https://github.com/goharbor/harbor/blob/master/README.md
https://gravitational.com/teleport/docs/kubernetes_ssh/ and https://gravitational.com/teleport/docs/architecture/teleport_architecture_overview/
https://github.com/aquasecurity/kube-hunter
https://github.com/GoogleContainerTools/skaffold vs https://www.deployhub.com/ vs https://tilt.dev/ vs https://squash.solo.io/ vs https://www.telepresence.io/ vs https://okteto.com/ vs https://draft.sh/
https://github.com/vmware-tanzu/octant
- "A Tour Through the Visualization Zoo": https://homes.cs.washington.edu/~jheer//files/zoo/
- "Metric graphs 101: Timeseries graphs": https://www.datadoghq.com/blog/timeseries-metric-graphs-101/
- https://accelazh.github.io/datamining/Time-Series-Learning-Algorithms-Candidates
- https://srcco.de/posts/how-zalando-manages-140-kubernetes-clusters.html
- https://www.youtube.com/watch?v=1xHmCrd8Qn8&t=197s
- https://blizzard.cs.uwaterloo.ca/keshav/home/Papers/data/07/paper-reading.pdf
- https://accelazh.github.io/technology/Roadmap-to-Technical-Leadership
- https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
- https://colin-scott.github.io/blog/2016/03/04/technologies-for-testing-and-debugging-distributed-systems/
- https://accelazh.github.io/transaction/Distributed-Transaction-ACID-Study
- https://blog.spinnaker.io/managed-delivery-evolving-continuous-delivery-at-netflix-eb74877fb33c
- https://github.com/spinnaker/keel
- https://docs.google.com/document/d/1cgKBdT5xVFvMwut7Wji_-bC_12GoQtyZ2MQ958LDcOY/edit#heading=h.v59gzsv79kfc
- https://techbeacon.com/app-dev-testing/how-airbnb-scaled-its-migration-continuous-delivery-spinnaker
- https://blog.spinnaker.io/how-netflix-has-extended-spinnaker-baf1a9d6b6e3
- https://blog.spinnaker.io/introducing-rollout-strategies-in-the-kubernetes-v2-provider-8bbffea109a
- https://glasnostic.com/blog/how-canary-deployments-work-2-developer-vs-operator-concerns
- https://github.com/weaveworks/flagger
- https://github.com/spinnaker/kayenta
- https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c
- https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1
- https://accelazh.github.io/storage/Engineering-Reliability-Practices
- https://segment.com/blog/goodbye-microservices/
- https://8thlight.com/blog/colin-jones/2018/09/18/microservices-arent-magic-handling-timeouts.html
- https://medium.com/@marcus.cavalcanti/lessons-learned-about-run-microservices-b360347c8a77
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/monitoring_architecture.md
- (what's Infrastore): kubernetes/kubernetes#44095 (comment)
- https://github.com/kubernetes/metrics
- kubernetes-sigs/metrics-server#7 (comment)
- https://web.archive.org/web/20180530051700/https://kubernetes.io/docs/tasks/debug-application-cluster/core-metrics-pipeline/
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/resource-metrics-api.md
- https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/metrics-server.md