juliandescottes · April 17, 2025 08:47
diff --git a/gistfile1.txt b/gistfile1.txt
 Tips focused on quickly addressing intermittent issues in the least intrusive way.

 1. Investigation

 Check the platforms: OS and variants are relevant. 

 If it happens on a variety of platforms + variants, it's usually a good sign that this is an actual implementation issue or a basic race condition in the test.
 Try to reproduce locally. If it only fails on debug, use a debug build first. 
 Run the test with --repeat 10, if it doesn't fail try with --verify (which takes more time).

 If you can reproduce locally & consistently, investigate this as usual.

 If not, you first need to check if you can get it regularly on try.

 If this is failing on a platform for which we have `verify` jobs, run it with rebuild 10:
 > ./mach try fuzzy path/to/test.js --rebuild 10
 > # then select 'verify !gpu !wpt and pick the platforms you want.

 If there are no `verify` jobs (eg windows asan), then run the test suite
 > ./mach try fuzzy path/to/test_manifest.toml --rebuild 20
 > # then select the right jobs

 If you don't get any failure on try with this, try to figure out a good skip-if condition and disable the test.

 If you have failures on try, at least you'll be able to reproduce and verify potential fixes.

 First try to identify the line where the test fails. 
 This is rarely called out in the intermittent bug, so adding a comment just pointing at the line where it fails helps.

 If the test didn't fully time out, you should have a screenshot + json dump in the artifacts on try.
 Investigate those.

 If the test timed out, check if we are missing logs to know exactly where it failed. If we do, add logs.
 Note the difference between the waitFor and waitUntil test helpers. They both wait for a condition, but waitFor will timeout early which allows to have a real failure before the browser shuts down and gives us a screenshot and json dump.

 If the test timed out on a waitUntil call, consider switching to waitFor. Note that waitFor only waits for 5 seconds total by default, and we can't blindly use it instead of all waitUntil calls as that might be too short in some cases.
 But in general, using waitFor is recommended.

 If the test timed out waiting for an event or a random promise, recording a video on try might be a good idea. This only works on Linux and Mac.

 Try to find a regression bug. This is rarely fruitful, it's hard to get the specific push where the intermittent started spiking. But it's still worth looking at the dates where it started failing. 

 Mitigation

 I assume you have looked at the failure, the artifacts, maybe recorded a video, but still can't find a way to fix the root cause, or even understanding the root cause. What can you do then, short of disabling the test.

 Try to wait. If a step times out, try to wait a bit before the action which triggered it. `await wait(1000)`. Check if that fixes it.

 Try to split. Sometimes tests are just too long and a previous step is not properly cleaning up everything. You can investigate the weird race conditions, bad cleanups, etc. for hours. Or you can split. Split a long add_task in several task so that each task has its own tab, its own toolbox etc. Sometimes that's enough to fix an intermittent.

 Another good last resort is to isolate the part of the test which fails in a new file. If a test is very long, tests too many things, but only fails on one specific step, it's a shame to disable the whole test just because of this. Isolate this in a new test file. Check that the remainder of the test now works fine, and proceed to disable your new test. Or maybe it makes it easier to investigate and you can now find a fix.

 In general, avoid tests which are too long. I know there's been an initiative to add longer more comprehensive tests. If they're rock solid why not. But as soon as they start being intermittent it's a pain. Keep your tests small and focused.

 Try your fix

 If you reproduced on try, a good tip to get results faster (and cheaper) is to use the `-E` argument when doing a try push.
 It will attempt to reuse the tasks from your previous push (see try fuzzy help for more details). Very useful for jobs requiring slow full builds (looking at you tsan and asan :wave:)
	Tips focused on quickly addressing intermittent issues in the least intrusive way.

	1. Investigation

	Check the platforms: OS and variants are relevant.

	If it happens on a variety of platforms + variants, it's usually a good sign that this is an actual implementation issue or a basic race condition in the test.
	Try to reproduce locally. If it only fails on debug, use a debug build first.
	Run the test with --repeat 10, if it doesn't fail try with --verify (which takes more time).

	If you can reproduce locally & consistently, investigate this as usual.

	If not, you first need to check if you can get it regularly on try.

	If this is failing on a platform for which we have `verify` jobs, run it with rebuild 10:
	> ./mach try fuzzy path/to/test.js --rebuild 10
	> # then select 'verify !gpu !wpt and pick the platforms you want.

	If there are no `verify` jobs (eg windows asan), then run the test suite
	> ./mach try fuzzy path/to/test_manifest.toml --rebuild 20
	> # then select the right jobs

	If you don't get any failure on try with this, try to figure out a good skip-if condition and disable the test.

	If you have failures on try, at least you'll be able to reproduce and verify potential fixes.

	First try to identify the line where the test fails.
	This is rarely called out in the intermittent bug, so adding a comment just pointing at the line where it fails helps.

	If the test didn't fully time out, you should have a screenshot + json dump in the artifacts on try.
	Investigate those.

	If the test timed out, check if we are missing logs to know exactly where it failed. If we do, add logs.
	Note the difference between the waitFor and waitUntil test helpers. They both wait for a condition, but waitFor will timeout early which allows to have a real failure before the browser shuts down and gives us a screenshot and json dump.

	If the test timed out on a waitUntil call, consider switching to waitFor. Note that waitFor only waits for 5 seconds total by default, and we can't blindly use it instead of all waitUntil calls as that might be too short in some cases.
	But in general, using waitFor is recommended.

	If the test timed out waiting for an event or a random promise, recording a video on try might be a good idea. This only works on Linux and Mac.

	Try to find a regression bug. This is rarely fruitful, it's hard to get the specific push where the intermittent started spiking. But it's still worth looking at the dates where it started failing.

	Mitigation

	I assume you have looked at the failure, the artifacts, maybe recorded a video, but still can't find a way to fix the root cause, or even understanding the root cause. What can you do then, short of disabling the test.

	Try to wait. If a step times out, try to wait a bit before the action which triggered it. `await wait(1000)`. Check if that fixes it.

	Try to split. Sometimes tests are just too long and a previous step is not properly cleaning up everything. You can investigate the weird race conditions, bad cleanups, etc. for hours. Or you can split. Split a long add_task in several task so that each task has its own tab, its own toolbox etc. Sometimes that's enough to fix an intermittent.

	Another good last resort is to isolate the part of the test which fails in a new file. If a test is very long, tests too many things, but only fails on one specific step, it's a shame to disable the whole test just because of this. Isolate this in a new test file. Check that the remainder of the test now works fine, and proceed to disable your new test. Or maybe it makes it easier to investigate and you can now find a fix.

	In general, avoid tests which are too long. I know there's been an initiative to add longer more comprehensive tests. If they're rock solid why not. But as soon as they start being intermittent it's a pain. Keep your tests small and focused.

	Try your fix

	If you reproduced on try, a good tip to get results faster (and cheaper) is to use the `-E` argument when doing a try push.
	It will attempt to reuse the tasks from your previous push (see try fuzzy help for more details). Very useful for jobs requiring slow full builds (looking at you tsan and asan :wave:)