Cached CI for HF proposal: Record‑and‑Bake HTTP Cache

0) Problem statement (today’s flakes)

We’re seeing repeated CI failures in a fresh container when tests make live HTTP calls. Example from today, 5 or so rerun failures on:

tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_expected_patches
tests/models/pix2struct/test_image_processing_pix2struct.py::Pix2StructImageProcessingTest::test_call_vqa

Typical errors:

HTTPError('429 Client Error: Too Many Requests for url: https://huggingface.co/ybelkada/fonts/resolve/main/Arial.TTF')
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x...> when doing Image.open(requests.get(..., stream=True).raw) for https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/australia.jpg.

We want fully deterministic, offline CI runs without touching library/test code.

1) Proposed approach (high-level)

Adopt a record → bake → replay workflow:

Record HTTP requests (one-off or on demand): execute the tests once with network access, while transparently recording all HTTP traffic to a local cache.
Bake caches into the Docker image: copy the recorded caches into the image as a layer.
Replay offline in CI: tests read exclusively from the baked caches. No network calls, no flakes.

This covers both kinds of traffic we have:

Hugging Face Hub traffic (fonts, model/dataset files): relies on the hub’s own on-disk cache directories.
Arbitrary requests.get(...) traffic (e.g. direct image URLs): captured and replayed by a global HTTP cache installed via sitecustomize.py.

2) Key pieces & exactly what we use

We use both of the following because our tests rely on Hub downloads and direct requests.get(...) calls:

2.1 `sitecustomize.py` + `requests-cache`

Python auto-imports sitecustomize at startup if present on the path.
We install a global requests-cache that transparently caches and replays any requests traffic (e.g., direct image URLs used by PIL).
We also force body materialization so stream=True responses get fully cached, avoiding truncated content errors.

2.2 Hugging Face Hub on-disk caches

We set cache env vars so Hub blobs (fonts, small images, model shards) land in known directories we can bake into the image:
- HF_HOME=/opt/hf_cache
- HF_HUB_CACHE=/opt/hf_cache/hub
- TRANSFORMERS_CACHE=/opt/hf_cache/transformers
- HF_DATASETS_CACHE=/opt/hf_cache/datasets

3) Implementation sketch

3.1 `sitecustomize.py`

Place this file into the image at site-packages/sitecustomize.py so it auto-loads:

# sitecustomize.py
import os, json, time, atexit
try:
    import requests, requests_cache
except Exception:
    requests = None; requests_cache = None

if requests and requests_cache:
    cache_dir = os.environ.get("HTTP_CACHE_DIR", "/opt/http_cache")
    os.makedirs(cache_dir, exist_ok=True)
    cache_path = os.path.join(cache_dir, "requests_cache")
    log_path = os.path.join(cache_dir, "http_log.ndjson")

    # Never expire; regenerate the cache explicitly when needed
    requests_cache.install_cache(cache_path, backend="sqlite", expire_after=None)

    def _log_and_materialize(r, *_, **__):
        # Ensure streamed responses are fully cached; log basic metadata
        try:
            _ = r.content
        except Exception:
            pass
        try:
            with open(log_path, "a") as f:
                f.write(json.dumps({
                    "ts": time.time(),
                    "method": getattr(r.request, "method", None),
                    "url": r.url,
                    "status": getattr(r, "status_code", None),
                    "from_cache": getattr(r, "from_cache", False),
                }) + "
")
        except Exception:
            pass

    try:
        sess = requests_cache.get_session()
        sess.hooks.setdefault("response", []).append(_log_and_materialize)
    except Exception:
        pass

    @atexit.register
    def _print_log_loc():
        print(f"[http-recorder] Log: {log_path}  Cache: {cache_path}.sqlite")

3.2 Generate the caches (record once, online)

Run the test suite once with network access to populate both the Hub caches and the requests-cache DB. Since slow tests are skipped by default, this will pull only small assets.

# Choose stable cache locations
export HF_HOME=/opt/hf_cache
export HF_HUB_CACHE=/opt/hf_cache/hub
export TRANSFORMERS_CACHE=/opt/hf_cache/transformers
export HF_DATASETS_CACHE=/opt/hf_cache/datasets
export HTTP_CACHE_DIR=/opt/http_cache

# Optional: faster first pull
export HF_HUB_ENABLE_HF_TRANSFER=1

# Run tests to record everything needed
pytest -q

# Package caches as build artifacts
tar -C /opt -czf hf_cache.tgz hf_cache
tar -C /opt -czf http_cache.tgz http_cache

3.3 Bake caches into the Docker image

Add the recorded caches as layers and include sitecustomize.py so runtime replays from cache automatically.

FROM python:3.11-slim

# Cache locations (must match recording step)
ENV HF_HOME=/opt/hf_cache \
    HF_HUB_CACHE=/opt/hf_cache/hub \
    TRANSFORMERS_CACHE=/opt/hf_cache/transformers \
    HF_DATASETS_CACHE=/opt/hf_cache/datasets \
    HTTP_CACHE_DIR=/opt/http_cache

# Minimal deps used by tests
RUN pip install --no-cache-dir requests requests-cache huggingface_hub pillow

# Auto-loads and patches requests globally
COPY sitecustomize.py /usr/local/lib/python3.11/site-packages/sitecustomize.py

# Bring in recorded caches (created in 3.2)
ADD hf_cache.tgz /opt/
ADD http_cache.tgz /opt/

# Ensure readability for non-root users in CI
RUN chmod -R a+rX /opt/hf_cache /opt/http_cache

WORKDIR /workspace
# COPY your repo here in the real Dockerfile

3.4 CI usage (stay online, rely on cache)

No special flags are required. Keep CI online; the runtime will prefer:

Hub blobs from HF_*_CACHE for models/datasets/fonts/images.
requests-cache for any direct requests.get(...) (e.g., the australia.jpg URL), served from the baked SQLite.

If a new URL appears in tests, it will be fetched live during CI; consider periodically re-running step 3.2 to refresh caches.

4) Notes

This design eliminates the observed flakes by ensuring both Hub assets and direct HTTP resources are already cached. Remaining network calls are rare metadata checks and new URLs.
Because slow tests are disabled by default, the baked caches should remain small (fonts, tiny images, small model shards).
If future flakes reappear due to new URLs, just regenerate caches (3.2) and rebuild the image.

manueldeprada/proposal.md