This content was presented at a sig-testing meeting on 8/25/2020, available as a video here
- http://git.k8s.io/community/contributors/devel/sig-testing/flaky-tests.md
- https://github.com/kubernetes/kubernetes/labels/kind%2Fflake
- http://storage.googleapis.com/k8s-metrics/flakes-latest.json
- https://testgrid.k8s.io
- http://go.k8s.io/triage
- https://godoc.org/golang.org/x/tools/cmd/stress
- https://golang.org/doc/articles/race_detector.html
0:41 - Deflaking Mindset
A flake means there's a problem in one or more of these places:
- in the thing being tested
- in the test
- in the thing running the test (infrastructure)
Temptation: assume flakes are always infrastructure (#3)
We've worked really hard to improve CI infrastructure consistency so this is less often the issue
Temptation: assume flakes are always a "test-only issue"
Always work to understand why a flake is happening before compensating in the test
Examples:
- before polling/sleeping in a test, make sure the thing you're testing is supposed to be happening asynchronously
- 3:12 - before lengthening a timeout in a test, understand what is taking up the time
- sometimes the longer time is expected and we actually need to tolerate longer delays
- 4:03 sometimes the longer time is not expected and we need to fix a bug
- 5:02 - watch out for changing a test in such a way that it is longer testing the same thing
- initial change in kubernetes/kubernetes#93622 (comment) fixed the test flake, but did not fix a real underlying issue
- final change in kubernetes/kubernetes#93622 fixed a real bug
5:56 - Finding things to fix
-
Reported flakes
- 6:14 - kubernetes/labels/kind:flake you can filter this list of flakes by adding a SIG label e.g.
label:sig/node
giving https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+label%3Akind%2Fflake+label%3Asig%2Fnode - 6:45 http://storage.googleapis.com/k8s-metrics/flakes-latest.json
- 6:14 - kubernetes/labels/kind:flake you can filter this list of flakes by adding a SIG label e.g.
-
Tests you see flake in your PRs
- before you
/retest
, see if there's an issue for the failed test
- before you
-
Tests that flake in your SIGs jobs/packages
10:28 - Good flake reports
Example: kubernetes/kubernetes#93358
- if a test flaking in multiple jobs that is useful to know as different jobs capture different runtime and network setups (can search the failing test or message via http://go.k8s.io/triage)
- if a test is flaking only in one job that is also useful to know as it could indicate that the flake is caused by a particular runtime setup.
- are there multiple tests in the same package/suite failing with the same apparent error?
- link to testgrid history for the jobs, filtered to the relevant tests
- include the failed test output (makes issue searchable)
- link to the triage query
- link to specific failures
- tag the relevant sig, if known
11:55 - Running unit tests to reproduce flakes
Example: kubernetes/kubernetes#93905
Simple attempt:
go test ./pkg/kubelet/config -run TestInvalidPodFiltered
Run bypassing go test
cache:
go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered
If the unit test passes when run locally then you have a flaking test.
Now we need to force it to fail. 12:53 - Run with race detection on:
go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered
If all of the above test invocations pass then use the stress tool
Build a standalone test binary (with race detection enabled) and stress test with https://godoc.org/golang.org/x/tools/cmd/stress:
go get golang.org/x/tools/cmd/stress
go test ./pkg/kubelet/config -race -c
If the above runs still pass then use stress as follows to run many instances in parallel over and over:
stress ./config.test -test.run TestInvalidPodFiltered
411 runs so far, 0 failures
/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
--- FAIL: TestInvalidPodFiltered (0.00s)
config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
FAIL
ERROR: exit status 1
815 runs so far, 1 failures
So now we have a way of reproducing the flake we can dig in and figure out the root cause.
14:47 - Running integration tests to reproduce flakes
Similar to unit tests, though most integration tests expect a started etcd instance
Start etcd in another tab:
etcd
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-08-25 11:56:46.975384 I | etcdmain: etcd Version: 3.4.9
...
Example failure/flake: kubernetes/kubernetes#93496 (comment)
go test -c -race ./test/integration/endpointslice
stress ./endpointslice.test -test.run TestEndpointSliceMirroring
...
ls -lat $TMPDIR/go-stress*
cat $TMPDIR/go-stress...
Example package deadlock/timeout: kubernetes/kubernetes#93890
- Run each test individually, figure out what the average runtime is
- Stress each test individually, bounding the timeout to 100x the average run time (avoids having to wait for 2 minutes for a timeout)
- Isolate the particular test that is deadlocking
- Add debug output to figure out what is causing the deadlock
Fix is here Sync cache before starting scheduler test #93893
19:07 - Deflaking end-to-end (e2e) tests
When running an e2e test, "the thing being tested" is actually the whole system, which is good and bad.
Good:
- we do need to make sure our system actually works when you run the whole thing
Bad:
- An e2e test of volume subpath behavior can flake because "something" deleted the test namespace (kubernetes/kubernetes#94221)
Takeaways:
- Skew toward more unit and integration tests where possible
- Don't assume the title of the failing e2e test identifies where the issue is. An e2e test can fail because of something totally unrelated to the title.
20:44 - Gather information
Example: kubernetes/kubernetes#87668 (comment)
- Pick relevant components (e.g. test log, API server, controller manager, kubelet)
- Filter logs to relevant info, e.g.:
- things that happened around the time of the failure
- things that happened to relevant objects/namespaces
- Prefix log lines with component
- Sort by time
Example: kubernetes/kubernetes#94159 (comment)
- Log lines pointed at something in runc
- Specific lines correlated to 1.0.0-rc10
- Searching runc issues revealed a relevant issue with retrying cgroup writes
- kubernetes/kubernetes#94159 (comment)
If you don't have visibility to timing or specific code branches/interactions, adding debug logging is fine:
Example: kubernetes/kubernetes#88297 (comment)
Asynchronous operations in the thing under test
- does the test assume something that runs in a goroutine or via a watch happens synchronously?
- to verify the issue, simulate slow operations to reproduce issues
time.Sleep(time.Second)
at the top of a goroutinetime.Sleep(time.Second)
at the beginning of a watch event handlertime.Sleep(time.Second)
at the end of a watch event handlertime.Sleep(time.Second)
at the beginning of a sync loop workertime.Sleep(time.Second)
at the end of a sync loop worker
Races between the test and the thing under test
- does the test conflict with some asynchronous process?
- to verify the issue, simulate the test losing the race by putting a time.Sleep(time.Second) between test steps
- Example: kubernetes/kubernetes#93496 (comment)
Tests that assume a "fast" operation
- A thing that takes < 1 second locally could take a few seconds in CI environments
- CI environments are generally more resource-constrained
- CI environments often run multiple tests in parallel
- Unless your test is specifically testing performance/timing, don't set tight timing tolerances
wait.ForeverTestTimeout
is a reasonable stand-in for "things that should not take very long"... use it instead of polling for 1-10 seconds
Incorrectly assuming deterministic output:
- map iteration in go is non-deterministic
- random allocation conflicting with static allocation in the same test
- if you are using a fake client with a watcher, it can relist/rewatch at any point
- look for specific actions in the fake client rather than asserting exact content of the full set
Always ask whether the fix belongs in the test or in the thing being tested
Example: http://issue.k8s.io/93922
A test flaked because the explicit health port conflicted with an auto-allocated node port. We could fix it by explicitly setting both in the test, or by fixing the auto-allocation to avoid explicitly specified ports.