-
REST APIs
- built-in go-based APIs
- custom resources
- x-k8s.io - experimental, fast prototyping
- k8s.io - "official", get API reviewed
- most difficult to change over time
- all (non-alpha) versions have to round-trip to each other losslessly
- all additions (to non-alpha version) have to preserve existing semantics for previous definitions
- most visible to users
-
Command line flags
- distinction between admin-facing (kube-apiserver) and user-facing (kubectl)
-
Config files
- can provide defaults
- can be versioned
- give a way to improve defaults over time at a config file version boundary
- indicate stability level of a feature/config format
-
"Backend" APIs
- grpc
- container runtime interface (CRI)
- container networking interface (CNI)
- container storage interface (CSI)
- kube-apiserver storage transformers
- kube-apiserver network proxy
- exec+json
- client-go exec credential plugins
- kubelet exec credential plugins
- grpc
As leads:
- Know what APIs are in your area
- Ensure people working on APIs in your area are familiar with API conventions
- If we do our job, people build things to integrate with the APIs we make
- Clients call REST APIs
- Integrations build support for backend APIs
- Deployers script and configure command lines and config files
-
Alpha
- not enabled by default
- can increment and drop previous alpha versions without migration
- lessons learned
- always have a clear picture of how you will transition to beta if the alpha goes well
- alpha annotations were a disaster to transition to API fields; ended up supporting both in parallel, poorly
- be explicit that something is alpha... make someone work to enable it
- things that work that we enabled by default, didn't make clear were alpha, and left unchanged for years are treated as GA (--node-labels)
- alpha is the time for fast iteration without the compatibility tax
- always have a clear picture of how you will transition to beta if the alpha goes well
-
Beta
- typically enabled by default
- must be forward convertible to next beta version or to GA version
- lessons learned
- keep focus on improving and moving towards GA
- CRDs took 2.5 years (10 releases) to go from v1beta1 to v1, accumulated enormous use on flawed beta versions in the meantime, and another 2 years (6 releases) to deprecate and stop serving the v1beta1 version
- perma-beta effectively gets treated as GA (people run businesses on these things)
- limited lifetime (3 releases until deprecation, 3 releases until removal)
- be confident you've resolved usability/scale/expressiveness issues before going from alpha to beta
- backwards compatibility and round-tripping with flawed beta versions is hard
- keep focus on improving and moving towards GA
-
GA
- Indicates long-term support, should be a signal for high quality (well-tested in multiple dimensions, functionally, scale, etc)
- Take compatibility super seriously for changes to an API once it is GA
- Always be asking the question: is there any way this change/addition could break a currently successful user?
- Example: tightening validation
- we err on the side of continuing to allow flawed values we previously accepted if it was possible to be successful with those flawed values
- kubernetes/kubernetes#64841
- Compatibility guarantees / minimum timelines for changing
As leads:
- Know the stability level of the APIs in your area
- In general, prioritize stabilizing/graduating those APIs (or deprecating/dropping non-GA APIs) over introducing new features (parallel work is fine, but lots of new alpha APIs without progressing existing ones accumulates poorly supported features)
The bigger the surface area:
- the harder it is to test thoroughly
- the harder it is for users to learn/use
- the more unanticipated combinations/interactions there can be
- the harder it is to support and evolve while staying compatible
Lessons learned
- PodSecurityPolicy
- tried to provide super expressive, fine-grained policy and defaulting control over a big chunk of the Pod API (which itself is very big)
- ran into trouble staying backwards compatible while adding support for new Pod capabilities while remaining usable
- some fields defaulted permissive for compatibility (controlling new Pod fields that allowed lowering permissions)
- some fields defaulted restrictive for compatibility (controlling new Pod fields that allowed raising permissions)
- replacement (in progress) has a much smaller surface area (level=privileged|baseline|restricted, optional version)
As leads:
- push back on introducing complexity (sometimes unavoidable, but always worth questioning)
- push towards layers instead of options (a simple boolean option can ~double the test matrix for a component)
Especially true for REST APIs, but most of these are true to some degree for most types of versioned APIs (REST, config, backend)
- Time to design
- Time to change
- Time to implement
- Time to review
- Target completing API implementations in the first few weeks of a development cycle
- Actively coordinate with API reviewer to set up time for review
- Allow O(week) for initial API review
- Not uncommon to have several review cycles
- Time to {unit,integration,e2e,scale} test
- Time to get feedback from users on alpha versions
- at least a release
- Time to promote to beta
- at least a release
- Time to document
- Give examples, use cases, examples for typical user scenarios
- Writing good docs in parallel with implementation often exposes user experience difficulties at a point where we can actually react by improving the implementation
- https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-request-and-response
- https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration
- Time to promote to GA
- at least a release
- Time to conformance test (REST API specific)
- general expectation is that new built-in REST APIs will be included in conformance
- if they are not generally safe or feasible for all clusters to enable, or are not broadly applicable enough to be in conformance, it might be a sign that they should not be built in
As leads, when planning a feature that involves an API:
- coordinate timing and bandwidth on the implementation and reviewer side
- ideally in the KEP phase ahead of a development cycle
- ensure there's a plausible roadmap from alpha to GA
- baked into the KEP and PRR processes, but actually think about the questions as an author or reviewer
- understand which steps require release boundaries
- for those steps, prioritize them early in a release cycle. the slowed cadence of 3 releases a year means missing a planned release is more significant
- have a plan for who will be doing the work across releases (new API is a ~year process)
Kill flakes: https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0
Overview/requirements
GA bar is "does it work well, is it supportable, scaleable, bug free, well tested, etc"
Conformance bar is higher: "GA + users expect this to be enabled in 100% of clusters + cluster providers can reasonably enable this"
No alpha or beta features can be in conformance
Tips:
As leads:
- When planning features, understand if they will be included in conformance testing
- baked into the KEP process, but actually think about conformance implications from a cluster provider and user perspective
- Ensure test plans get conformance-eligible tests in place early
- Structure tests so it is easy to switch from beta to GA endpoint (e.g.
import myapi "k8s.io/api/myapi/v1beta1
) without rewriting entire test - Pay attention to test flakiness (always good, but required for conformance tests)
- Pay attention to test coverage during beta
- Aim for 100% non-flaky coverage during beta
- Makes switching test to v1 trivial
- https://apisnoop.cncf.io/
"internal"
- https://github.com/kubernetes/kubernetes/
- holds core Kubernetes binaries (kube-apiserver, kube-controller-manager, kubelet, kube-scheduler, kube-proxy, etc)
- not intended for use as a library by applications outside kubernetes/kubernetes
"staging"
- https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io
- published as stand-alone modules
- can be used as libraries by other applications
As leads:
- pay attention to where code is going
- things in "staging" should be expected to be consumed outside kubernetes/kubernetes
"vendor"
- Dependencies, Dependencies, Dependencies, Dependencies
- we have too many dependencies
- they cause gridlock
- A -> Cv1, B -> Cv2
- increase surface area for security issues
- makes resolving those security issues more difficult
- increases likelihood of depending on unmaintained things
- they cause gridlock
- process for updating dependencies
- improvements we're trying to make
- visibility to whether a dependency change is good or bad
- check-dependency-stats CI job (highlights delta impact of a PR)
- https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/103099/check-dependency-stats/1408279189122977792
- work with upstreams to reduce dependencies. random examples:
- cadvisor:
- go-kit/kit: prometheus/common#304
- visibility to whether a dependency change is good or bad
As leads:
- be aware of key dependencies your area has
- node: depends on cadvisor, runc
- api-machinery: depends on json/yaml libraries
- etc
- work on processes for picking up security/bugfix issues in those dependencies
- be aware of problematic characteristics, plan to isolate and drop those
- cloud provider: extract to standalone binaries --> drop cloud provider dependencies
- storage: volume plugin extraction to CSI --> drop volume plugin dependencies
- node: dockershim deprecation to CRI --> drop docker dependencies