Skip to content

Instantly share code, notes, and snippets.

@arubis
arubis / glitchtip-webhook-review.md
Last active March 25, 2026 01:24
Task review: glitchtip-alert-webhook-failure (v18)

Task Review: glitchtip-alert-webhook-failure (v18)

UUID: 6ec7ee26-e0a2-4bb3-a03a-3a57c15bba4e | Author: peterkay_86616 | Category: platform-eng | Difficulty: hard
Discord: https://discord.com/channels/1427397917685321919/1483839196408840232

Verdict: NEEDS_WORK

Solid task — bug design, mock services, and decoys are all well-crafted. Three grading structure fixes needed, then re-eval on biggie. Bug set and setup are fine as-is.


@arubis
arubis / airgap-imagepull-cascade.md
Last active March 24, 2026 21:25
Air-Gap Cascade: imagePullPolicy Always destabilizes ArgoCD in Nebula

Air-Gap Hazard: imagePullPolicy: Always and ArgoCD Repo-Server Stability

When a task's setup.sh installs Helm releases with imagePullPolicy: Always in Nebula's air-gapped environment, the pods start successfully on idle machines — but under resource pressure (hosted infra, CI runners, concurrent containers), the registry pull timeout can trigger a crash-loop cascade that takes down ArgoCD's repo-server.

Discovered during review of broken-canary-gitops-migration-recovery (v22), where test-solution scored 0.0 despite the solution completing correctly.


Reproduction

@arubis
arubis / broken-canary-review.md
Last active March 24, 2026 23:05
Task Review: broken-canary-gitops-migration-recovery (v22)

Task Review: broken-canary-gitops-migration-recovery (v25)

UUID: 72edf9a5-6e74-4b21-ad8d-6dc1fce79813
Author: yashpatil2000
Category: platform | Difficulty: medium


Summary

@arubis
arubis / statefulset-diagnostic-linearity-proof.md
Created March 24, 2026 19:18
Empirical proof: PostgreSQL StatefulSet replica recovery has a linear diagnostic path (apex-arena task review)

Empirical Proof: PostgreSQL StatefulSet Replica Recovery Has a Linear Diagnostic Path

Context: This document proves that the "StatefulSet Ordinal Disruption" task scenario (PostgreSQL replica WAL/pg_control corruption after pod eviction) cannot be hardened to appropriate difficulty for apex-arena evaluation. The diagnostic path from broken state to fix is a straight line with zero decision branching.

Method: We reproduced the exact broken state inside a running Nebula environment, then traced every diagnostic command an agent would run, capturing real outputs. Each step's output unambiguously points to exactly one next step.

@arubis
arubis / grader-timeout-score-bug.md
Created March 24, 2026 15:39
apex-arena: grader timeout recorded as successful run with score 0

Grader Timeout Recorded as Successful Run (score=0)

Summary

When a grader.py times out during a hosted evaluation, the system records it as a successful run with score 0.0 rather than an error. This skews aggregate scores because timeout runs are included in the mean calculation instead of being excluded or flagged.

Observed Behavior

@arubis
arubis / oidc-token-replay-review-v25.md
Last active March 19, 2026 04:03
Task Review: oidc-token-replay-mitigation v25 — NEEDS_WORK (Keycloak realm must be specified)

Task Review: oidc-token-replay-mitigation (v25)

UUID 3b31f3b1-2f04-4033-acbe-e3f9dd8c6343
Version 25
Category Security & Authentication
Difficulty Hard
Reviewer Dylan Fitzgerald
Status NEEDS_WORK — one required change (Keycloak realm specification), plus minor recommendations
@arubis
arubis / keycloak-task-review.md
Created March 17, 2026 22:42
Feedback: keycloak-oidc-token-signing-key-rotation (v8)

Feedback: keycloak-oidc-token-signing-key-rotation (v8)

Task: chrorrala
Reviewer: Dylan Fitzgerald
Version: v8 (UUID bd24c35b-157b-400b-bcdb-88e539b2467c)
Rollouts: 10 × biggie-nebula


Current State

@arubis
arubis / minio-code-and-disk-difficulty-recommendations.md
Last active March 13, 2026 02:00
Difficulty tuning recommendations for minio-code-and-disk task (v34)

Difficulty Tuning Recommendations: minio-code-and-disk (v34)

Bottom Line

The setup corrupts both format.json and xl.meta, but xl.meta corruption is detectable at runtime — so mc admin info on the running pod immediately shows data1=corrupt, data3=corrupt. Every agent (10/10) runs this command early, gets the unambiguous answer, and ignores the monitoring artifacts entirely. The diagnostic puzzle provides zero signal.

The fix has three parts, all low-effort changes to setup.sh:

  1. Fix the corruption method — corrupt only format.json (not xl.meta), using valid JSON with wrong disk UUIDs (not random bytes). This blocks both mc admin info and filesystem inspection on the running pod.
  2. Make the monitoring classifier confidently wrong — point it at data1+data4 (truth is data1+data3), creating a false consensus trap that agents must see through after restart.
@arubis
arubis / pvc-snapshot-review.md
Last active March 13, 2026 00:52
Review: Cross-Service PVC Snapshot Orchestration (59932490)

AC6 Proposal: Write Isolation as Difficulty Ratchet

Task: Cross-Service PVC Snapshot Orchestration | UUID: 59932490-21ef-4882-81c1-64a2052d8db1 | Version: 25

Context

V25 mean score is 0.458 with grader bugs in AC2 and AC5 accounting for most failures. Once those are fixed, scores will likely rise above the 0.70 threshold. AC6 — currently 0/8 pass, also due to a grader bug — is the natural place to add genuine difficulty to compensate.

The current AC6 compares latest data timestamps across restored databases and checks they're within 30 seconds. This is nearly redundant with AC3 (which already validates snapshot timing) and fails today only because of the same MongoDB readiness bug as AC5. Once that's fixed, AC6 becomes a near-freebie, since the 30-second threshold is too generous to distinguish quiesced from unquiesced snapshots.

@arubis
arubis / redis-cluster-author-response.md
Last active March 12, 2026 00:49
Addressing "undisclosed spec" issue from nebula-reviewer

Are the "undisclosed spec" findings real? Yes — here's the evidence

Review feedback for Redis Cluster Slot Migration Deadlock (f925de8b, v70). The author asserted that the reviewer bot's findings about undisclosed requirements were false and did not impact the solution. We re-examined the grader, the environment, and all 10 eval transcripts.

A full task review with per-check breakdown and score analysis is also available.


"The bot's recommendations about undisclosed specs are false"