Conner Swann yourbuddyconner

Retry-After Headers for Transient Errors

Problem Statement

Today, when requests fail due to transient conditions (downstream timeouts, rate-limit blocks), callers have no machine-readable signal for how long to wait before retrying. Messages like "Resource exhausted; please wait a minute and try again" are human-readable but not actionable by SDKs or CI harnesses. This forces clients to use fixed sleep times or blind retry loops.

Note: the codebase already has error handling patterns that preserve meaningful messages for user-facing error codes (via IsUserError()). Sanitization to opaque "internal server error (UUID)" only applies to internal/non-user errors. Whether error messages are consistently useful across all code paths is a separate tech debt question.

The standard HTTP Retry-After header tells callers "this failed, but try again after N seconds." That's the missing piece.

GitHub App Manifest Installation Flow

Date: 2026-04-07 Status: Draft Scope: Admin GitHub App setup via manifest flow, post-setup management UI, single-installation model

Problem

The current GitHub App setup requires an admin to manually create a GitHub App on github.com, copy the App ID and PEM private key, paste them into the Valet settings form, and click "Verify" to discover installations. This is high-friction and error-prone — especially for read-only repo access, which should be a two-click operation.

Proposal: Retry-After headers for transient errors + e2e flakiness SLO

Context

Investigation into e2e flakiness on main (original writeup) found that a significant portion of test failures are caused by transient errors (services not ready, RPC providers briefly unreachable, etc.) that get sanitized into opaque "internal server error (UUID)" responses. Callers — both e2e tests and production users — can't distinguish transient from permanent failures, so they can't make informed retry decisions.

After discussion with @zane, @Bijan, @Mohammad, @Mohammed, and @omkar, we aligned on two proposals:

A mechanism for services to signal "this is transient, retry" without exposing internal error details
An SLO framework for e2e test reliability that automatically detects flaky tests and routes them to the right team

Error sanitization causes opaque 500s, masking transient failures in e2e tests

The issue

Our error sanitization framework (pkg/errors) replaces internal error messages with "internal server error (UUID)" before they reach callers. This is correct for production security, but it has a side effect: when a transient failure occurs (RPC provider briefly unreachable, service not yet warmed up, etc.), the caller gets the same opaque response as a genuine internal bug. Tests — and users — can't distinguish between the two, and can't make informed retry decisions.

The problem is amplified in the broadcasting path, where there are two independent error classification layers that both need to agree for the real error message to reach the caller:

The RPC client maps EVM errors to gRPC codes (ToGrpcErrorCode)
The broadcaster independently checks if the error is user-attributable ([`isUserBro

	# service.yaml
	service:
	readiness_probe: /v1/models
	replicas: 1

	# Fields below describe each replica.
	resources:
	ports: 8000
	cpus: 4+
	accelerators: {A100:1}

	// Define a function to handle the document and return its type
	function discoverDocumentType(document) {
	// Code to discover the type of document
	return documentType;
	}

	// Define a function to retrieve pre-built question examples from a database
	function getQuestionExamples(documentType) {
	if (documentType === "legal contract") {
	return [

	import docker
	import os
	import subprocess
	import click
	import glob
	import json
	import random
	import re
	import sys
	from pathlib import Path

	#!/usr/bin/env python3

	# script to find common best-tip prefix over a list of nodes using GraphQL query

	import os
	import sys
	import json
	import click
	import subprocess
	import requests