API Changes

What APIs?

REST APIs
- built-in go-based APIs
- custom resources
  - x-k8s.io - experimental, fast prototyping
  - k8s.io - "official", get API reviewed
- most difficult to change over time
  - all (non-alpha) versions have to round-trip to each other losslessly
  - all additions (to non-alpha version) have to preserve existing semantics for previous definitions
- most visible to users
Command line flags
- distinction between admin-facing (kube-apiserver) and user-facing (kubectl)
Config files
- can provide defaults
- can be versioned
- give a way to improve defaults over time at a config file version boundary
- indicate stability level of a feature/config format
"Backend" APIs
- grpc
  - container runtime interface (CRI)
  - container networking interface (CNI)
  - container storage interface (CSI)
  - kube-apiserver storage transformers
  - kube-apiserver network proxy
- exec+json
  - client-go exec credential plugins
  - kubelet exec credential plugins

As leads:

Know what APIs are in your area
Ensure people working on APIs in your area are familiar with API conventions

(Good) APIs are stable

If we do our job, people build things to integrate with the APIs we make
Clients call REST APIs
Integrations build support for backend APIs
Deployers script and configure command lines and config files

Be super clear about stability levels

Alpha
- not enabled by default
- can increment and drop previous alpha versions without migration
- lessons learned
  - always have a clear picture of how you will transition to beta if the alpha goes well
    - alpha annotations were a disaster to transition to API fields; ended up supporting both in parallel, poorly
  - be explicit that something is alpha... make someone work to enable it
    - things that work that we enabled by default, didn't make clear were alpha, and left unchanged for years are treated as GA (--node-labels)
  - alpha is the time for fast iteration without the compatibility tax
Beta
- typically enabled by default
- must be forward convertible to next beta version or to GA version
- lessons learned
  - keep focus on improving and moving towards GA
    - CRDs took 2.5 years (10 releases) to go from v1beta1 to v1, accumulated enormous use on flawed beta versions in the meantime, and another 2 years (6 releases) to deprecate and stop serving the v1beta1 version
    - perma-beta effectively gets treated as GA (people run businesses on these things)
  - limited lifetime (3 releases until deprecation, 3 releases until removal)
    - https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1635-prevent-permabeta
  - be confident you've resolved usability/scale/expressiveness issues before going from alpha to beta
    - backwards compatibility and round-tripping with flawed beta versions is hard
GA
- Indicates long-term support, should be a signal for high quality (well-tested in multiple dimensions, functionally, scale, etc)
- Take compatibility super seriously for changes to an API once it is GA
- Always be asking the question: is there any way this change/addition could break a currently successful user?
- Example: tightening validation
  - we err on the side of continuing to allow flawed values we previously accepted if it was possible to be successful with those flawed values
  - kubernetes/kubernetes#64841
- Compatibility guarantees / minimum timelines for changing
  - https://kubernetes.io/docs/reference/using-api/deprecation-policy/
  - https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-flag-or-cli

As leads:

Know the stability level of the APIs in your area
In general, prioritize stabilizing/graduating those APIs (or deprecating/dropping non-GA APIs) over introducing new features (parallel work is fine, but lots of new alpha APIs without progressing existing ones accumulates poorly supported features)

Good APIs are as small as possible

The bigger the surface area:

the harder it is to test thoroughly
the harder it is for users to learn/use
the more unanticipated combinations/interactions there can be
the harder it is to support and evolve while staying compatible

Lessons learned

PodSecurityPolicy
- tried to provide super expressive, fine-grained policy and defaulting control over a big chunk of the Pod API (which itself is very big)
- ran into trouble staying backwards compatible while adding support for new Pod capabilities while remaining usable
  - some fields defaulted permissive for compatibility (controlling new Pod fields that allowed lowering permissions)
  - some fields defaulted restrictive for compatibility (controlling new Pod fields that allowed raising permissions)
- replacement (in progress) has a much smaller surface area (level=privileged|baseline|restricted, optional version)
  - https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/2579-psp-replacement

As leads:

push back on introducing complexity (sometimes unavoidable, but always worth questioning)
push towards layers instead of options (a simple boolean option can ~double the test matrix for a component)

Good APIs take time

Especially true for REST APIs, but most of these are true to some degree for most types of versioned APIs (REST, config, backend)

Time to design
Time to change
Time to implement
Time to review
- Target completing API implementations in the first few weeks of a development cycle
- Actively coordinate with API reviewer to set up time for review
- Allow O(week) for initial API review
- Not uncommon to have several review cycles
Time to {unit,integration,e2e,scale} test
Time to get feedback from users on alpha versions
- at least a release
Time to promote to beta
- at least a release
Time to document
- Give examples, use cases, examples for typical user scenarios
- Writing good docs in parallel with implementation often exposes user experience difficulties at a point where we can actually react by improving the implementation
- https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-request-and-response
- https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration
Time to promote to GA
- at least a release
Time to conformance test (REST API specific)
- general expectation is that new built-in REST APIs will be included in conformance
- if they are not generally safe or feasible for all clusters to enable, or are not broadly applicable enough to be in conformance, it might be a sign that they should not be built in

As leads, when planning a feature that involves an API:

coordinate timing and bandwidth on the implementation and reviewer side
- ideally in the KEP phase ahead of a development cycle
ensure there's a plausible roadmap from alpha to GA
- baked into the KEP and PRR processes, but actually think about the questions as an author or reviewer
- understand which steps require release boundaries
- for those steps, prioritize them early in a release cycle. the slowed cadence of 3 releases a year means missing a planned release is more significant
- have a plan for who will be doing the work across releases (new API is a ~year process)

Tests and test infrastructure

Kill flakes: https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0

Conformance

Overview/requirements

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md#conformance-test-requirements

GA bar is "does it work well, is it supportable, scaleable, bug free, well tested, etc"

Conformance bar is higher: "GA + users expect this to be enabled in 100% of clusters + cluster providers can reasonably enable this"

No alpha or beta features can be in conformance

https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/1333-conformance-without-beta

Tips:

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/writing-good-conformance-tests.md

As leads:

When planning features, understand if they will be included in conformance testing
- baked into the KEP process, but actually think about conformance implications from a cluster provider and user perspective
Ensure test plans get conformance-eligible tests in place early
Structure tests so it is easy to switch from beta to GA endpoint (e.g. import myapi "k8s.io/api/myapi/v1beta1) without rewriting entire test
Pay attention to test flakiness (always good, but required for conformance tests)
Pay attention to test coverage during beta
- Aim for 100% non-flaky coverage during beta
- Makes switching test to v1 trivial
https://apisnoop.cncf.io/

Code Organization

"internal"

https://github.com/kubernetes/kubernetes/
holds core Kubernetes binaries (kube-apiserver, kube-controller-manager, kubelet, kube-scheduler, kube-proxy, etc)
not intended for use as a library by applications outside kubernetes/kubernetes

"staging"

https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io
published as stand-alone modules
can be used as libraries by other applications

As leads:

pay attention to where code is going
things in "staging" should be expected to be consumed outside kubernetes/kubernetes

"vendor"

Dependencies, Dependencies, Dependencies, Dependencies
we have too many dependencies
- they cause gridlock
  - A -> Cv1, B -> Cv2
- increase surface area for security issues
- makes resolving those security issues more difficult
- increases likelihood of depending on unmaintained things
process for updating dependencies
- https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/vendor.md
improvements we're trying to make
- visibility to whether a dependency change is good or bad
  - check-dependency-stats CI job (highlights delta impact of a PR)
  - https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/103099/check-dependency-stats/1408279189122977792
- work with upstreams to reduce dependencies. random examples:
  - cadvisor:
    - google/cadvisor#2209
    - google/cadvisor#2437
    - allowed kubernetes/kubernetes#76291
  - go-kit/kit: prometheus/common#304

As leads:

be aware of key dependencies your area has
- node: depends on cadvisor, runc
- api-machinery: depends on json/yaml libraries
- etc
work on processes for picking up security/bugfix issues in those dependencies
be aware of problematic characteristics, plan to isolate and drop those
- cloud provider: extract to standalone binaries --> drop cloud provider dependencies
- storage: volume plugin extraction to CSI --> drop volume plugin dependencies
- node: dockershim deprecation to CRI --> drop docker dependencies

liggitt/2021-06-25-leads.md

API Changes

What APIs?

(Good) APIs are stable

Be super clear about stability levels

Good APIs are as small as possible

Good APIs take time

Tests and test infrastructure

Conformance

Code Organization