Description of the PR...
[marin-community/marin#5872] [iris] Robust handling of preempt-induced sibling self-exits: worker preempt-watcher + controller atomic slice-preempt
Spun off from #5753 (sibling self-exits being miscounted as TASK_STATE_FAILED). #5753 proposed broadening the TPU_INIT_FAILURE_PATTERNS substring list in lib/iris/src/iris/cluster/worker/tpu_health.py to also catch JAX-distributed-RPC peer-loss / SIGABRT signatures so they promote to TASK_STATE_WORKER_FAILED instead of charging max_retries_failure.
That pattern-list extension is whack-a-mole and risks suppressing legitimate failures whose stderr happens to overlap. Per @rjpower's review on #5753, the principled fix is two changes that together remove the need for stderr pattern matching for this class of failure.