Skip to content

Instantly share code, notes, and snippets.

@blink1073
Last active May 22, 2026 22:35
Show Gist options
  • Select an option

  • Save blink1073/121c1323e09466dfddb543b5fc71618e to your computer and use it in GitHub Desktop.

Select an option

Save blink1073/121c1323e09466dfddb543b5fc71618e to your computer and use it in GitHub Desktop.
Plan: Remove data files from scikit-image repo and serve from CDN

Plan: Remove Data Files from Repo and Serve from CDN

TL;DR

  1. GitLab data repo — consolidate all files under a canonical directory structure and create a tagged release (v1)
  2. _registry.py — replace the 65-line hardcoded registry_urls dict with {k: _DATA_REPO_BASE + k for k in registry}
  3. meson.build — remove the 31 bundled legacy data files from the wheel
  4. _fetchers.py — drop the legacy bundled-file fallback; add a stdlib urllib fallback for when pooch isn't installed; improve error messages; clean up the GitHub URL logic
  5. git rm — remove all binary data files from the repo (no history rewrite)
  6. tools/download_data.py — new stdlib-only script for Linux distros to pre-download data and set SKIMAGE_DATADIR
  7. Docs — repackager notice in README.txt + release notes; expanded "Adding Data" section in CONTRIBUTING.md

Context

The scikit-image repo bundles 54+ data files (15 MB) in src/skimage/data/. Of these, 31 are explicitly packaged into the wheel via meson.build. The remainder are downloaded on demand via pooch from a GitLab CDN (https://gitlab.com/scikit-image/data/). The goal is to stop bundling any data files in the wheel/sdist, serve everything from the GitLab CDN, and provide a standalone download script for Linux distributions that need to package data offline.

Key findings:

  • All 64 data/ files have GitLab CDN URLs in _registry.py, but they reference 5 different commit SHAs — not tagged releases. Some files are also stored under different names/subdirectory paths in the GitLab data repo (e.g., skin.jpgNormal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG).
  • 6 orphan files exist in the directory but are not in the registry or referenced anywhere: block.png, simple.fits, multi.fits, chessboard_GRAY_U8.npz, chessboard_RGB_U8.npy, chessboard_RGB_U8.npz
  • Data functions are used in production code ONLY in docstrings (not runtime code paths)
  • download_all() already exists in _fetchers.py but requires pooch
  • Test data in tests/skimage/color/data/ (168 KB) and tests/skimage/registration/data/ (296 KB) is separate and accessed directly — out of scope for this change

Current SHA groups in registry_urls:

  • 5c090b56... — 50 files (bulk of standard images/arrays)
  • 2cdc5ce8... — 12 files (large scientific TIFs: brain, cells3d, kidney, lily, mitosis, etc.)
  • b2bc880f... — 1 file (palisades_of_vogt.tif, stored as in-vivo-cornea-spots.tif)
  • 1e4f62ac... — 1 file (eagle.png)
  • 806548e1... — 1 file (gray_morph_output.npz, stored in a subdirectory)

Files to Modify

  • src/skimage/data/meson.build — remove legacy data file installation block
  • src/skimage/data/_fetchers.py — remove legacy fallback, add urllib fallback, improve errors
  • pyproject.toml — move pooch from [data] optional extra to a clearly documented optional dep
  • tests/skimage/data/test_data.py — add fetch() guards for formerly-bundled dataset tests
  • tools/download_data.py — new standalone download script

Implementation Steps

0. Create a tagged release in the GitLab data repo (prerequisite)

This must happen before removing files from the scikit-image repo.

The current registry_urls references 5 different commit SHAs. We need to consolidate all files under a single tagged release with canonical filenames.

Steps (done manually in the GitLab data repo):

  1. Gather all data files — they're still in src/skimage/data/ locally
  2. Ensure every file exists in the GitLab repo with its canonical scikit-image name at the root level (e.g., skin.jpg not Normal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG, gray_morph_output.npz not nested in Tests_besides_Equalize_Otsu/)
  3. Create a tag v1 (or v0.1) on the GitLab data repo pointing to a commit that contains all files
  4. GitLab release URL pattern will be: https://gitlab.com/scikit-image/data/-/raw/v1/{filename}

Update _registry.py to eliminate the hardcoded registry_urls dict entirely:

_DATA_REPO_TAG = "v1"
_DATA_REPO_BASE = f"https://gitlab.com/scikit-image/data/-/raw/{_DATA_REPO_TAG}/"

registry_urls = {k: _DATA_REPO_BASE + k for k in registry}

The entire 65-line hardcoded registry_urls dict is replaced by a one-liner. The GitLab data repo must mirror the registry key paths exactly (e.g., data/camera.png, restoration/astronaut_rl.npy, color/data/lab_array_a_10.npy). When the data repo gets a new release (v2), only _DATA_REPO_TAG needs to change.

Files needing canonical name fixes (must match their skimage.data name):

  • data/palisades_of_vogt.tif (currently in-vivo-cornea-spots.tif in GitLab)
  • data/skin.jpg (currently Normal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG)
  • data/protein_transport.tif (currently NPCsingleNucleus.tif)
  • data/solidification.tif (currently nickel_solidification.tif)
  • data/kidney.tif (currently kidney-tissue-fluorescence.tif)
  • data/lily.tif (currently lily-of-the-valley-fluorescence.tif)
  • data/gray_morph_output.npz (currently in Tests_besides_Equalize_Otsu/ subdirectory)
  • data/rank_filters_tests_3d.npz (currently in Tests_besides_Equalize_Otsu/add18_entropy/)
  • data/pivchallenge-B-B001_1.tif (currently in pivchallenge/B/ subdirectory)
  • data/pivchallenge-B-B001_2.tif (currently in pivchallenge/B/ subdirectory)
  • data/mitosis.tif (currently AS_09125_050116030001_D03f00d0.tif)
  • restoration/astronaut_rl.npy (path is astronaut_rl.npy at root — needs to be in restoration/ dir or kept as flat name)

1. Update meson.build

Remove the entire second py3.install_sources([...]) block (lines 15–53) that installs the 31 legacy data files. Move README.txt into the first python_sources list (keeps it installed in the wheel — needed by _ensure_cache_dir() which copies it to the local cache).

2. Update _fetchers.py

Add _get_default_cache_dir() helper:

def _get_default_cache_dir():
    custom = os.environ.get('SKIMAGE_DATADIR')
    if custom:
        return custom
    xdg_cache = os.environ.get('XDG_CACHE_HOME', osp.expanduser('~/.cache'))
    return osp.join(xdg_cache, 'scikit-image')

Add _urllib_fetch(url, dest_path, expected_hash) helper:

  • Downloads with urllib.request.urlopen(..., timeout=30)
  • Writes to dest_path + '.part' then atomically renames via os.replace()
  • Validates SHA256 hash; deletes .part file on mismatch
  • Raises ConnectionError with descriptive message pointing to tools/download_data.py and SKIMAGE_DATADIR on failure

Revise _fetch(data_filename, prefix=None):

  • Remove "Case 2: file is present in legacy_data_dir" (lines 231–234)
  • When not cached and need to download: use pooch if available, else _urllib_fetch()
  • Both paths call _skip_pytest_case_requiring_pooch() on ConnectionError (rename this function to _skip_pytest_case_requiring_network() for accuracy)
  • Add a defensive KeyError for unknown filenames and a KeyError for files missing from registry_urls

Error message for failed downloads (shown to users):

ConnectionError: Failed to download scikit-image dataset 'data/camera.png'.

To use scikit-image offline, pre-download all datasets:
  Option 1 (with pooch):
    python -c "import skimage.data; skimage.data.download_all('/path/to/cache')"
  Option 2 (no extra deps):
    python tools/download_data.py --dest /path/to/cache

Then set: export SKIMAGE_DATADIR=/path/to/cache

Update _create_image_fetcher():

  • When pooch is absent, return (None, _get_default_cache_dir()) instead of (None, _LEGACY_DATA_DIR)
  • Remove the prefix parameter entirely — the prefix='tests' call at module level is vestigial. Since all 64 data/ files have explicit registry_urls entries pointing to GitLab, the base_url fallback is never used. Remove the GitHub URL logic and set base_url to the GitLab base URL instead (or omit it and rely entirely on registry_urls).
  • Module-level call becomes: _image_fetcher, data_dir = _create_image_fetcher()

Update download_all():

  • Remove the ModuleNotFoundError bail-out when pooch is absent
  • Use _urllib_fetch() as fallback so it works without pooch

Remove module-level names: _LEGACY_DATA_DIR and _DISTRIBUTION_DIR once the legacy Case 2 is gone. Update _ensure_cache_dir() to use osp.join(osp.dirname(__file__), 'README.txt') directly.

3. Create tools/download_data.py (stdlib-only, no pooch)

CLI interface:

python tools/download_data.py --dest /path/to/cache       # download all
python tools/download_data.py --dest /path/to/cache --jobs 4  # parallel
python tools/download_data.py --dest /path/to/cache --check-only  # verify hashes, exit 1 if any fail
python tools/download_data.py --list                       # print all files+URLs
python tools/download_data.py --dest /path/to/cache --file data/camera.png  # single file

Key design:

  • Imports registry and registry_urls from skimage.data._registry (pure Python, no C extensions)
  • Uses urllib.request + hashlib — no extra dependencies
  • Parallel downloads via concurrent.futures.ThreadPoolExecutor when --jobs > 1
  • Prints export SKIMAGE_DATADIR=/path/to/cache at completion
  • Exit codes: 0 = success, 1 = download/hash failure, 2 = usage error

4. Remove data files from git

Prerequisite: Step 0 (GitLab tagged release) must be complete and _registry.py updated with tag-based URLs before this step.

git rm all binary data files from src/skimage/data/:

  • All files currently listed in meson.build lines 19–52
  • The 6 orphan files not in the registry
  • All other binary files in the directory (remaining ~17 files already CDN-only)
  • Keep: __init__.py, __init__.pyi, _binary_blobs.py, _fetchers.py, _registry.py, README.txt, meson.build

Note: This does NOT rewrite git history (no git filter-repo). The files remain in git history but are removed from the working tree and future commits. A history rewrite is a separate, coordinated action for a later major release.

5. Update pyproject.toml

Move/update the pooch dependency note. The [data] extra currently contains pooch>=1.6.0. Update the description to clarify pooch is strongly recommended but optional (urllib fallback exists). Consider renaming [data] extra to [download] to better reflect its purpose.

6. Update tests/skimage/data/test_data.py

Tests that call formerly-bundled data functions (e.g., data.camera(), data.coins()) now require a network round-trip on first run. Add a fetch('data/camera.png') call at the top of each such test so it is properly skipped in offline environments (consistent with existing pattern for test_eagle() etc.).

7. Add repackager notice

Add a prominent notice to src/skimage/data/README.txt and to the release notes / migration guide explaining the change for downstream repackagers (Linux distros, conda-forge, etc.):

NOTICE FOR REPACKAGERS
======================
As of scikit-image X.Y, data files are no longer bundled in the wheel or
source distribution. All sample datasets are now served from the GitLab CDN
at https://gitlab.com/scikit-image/data.

To package scikit-image for offline use (e.g., in a Linux distribution):

1. Download all data files during the package build step:

       python tools/download_data.py --dest /usr/share/scikit-image/data --jobs 4

2. Set the SKIMAGE_DATADIR environment variable so scikit-image finds them:

       export SKIMAGE_DATADIR=/usr/share/scikit-image/data

   This can be set system-wide (e.g., /etc/profile.d/scikit-image.sh) or
   in the test runner wrapper script.

3. The download script requires only Python stdlib (no pooch). It verifies
   SHA256 hashes for all files and exits with code 1 on any failure.

Also add a .. deprecated:: / .. versionchanged:: notice in the skimage.data module docstring.

8. Update CONTRIBUTING.md — "Adding Data" section

The existing "Adding Data" section (line 883) is brief. Expand it to document the full workflow for contributors adding a new dataset:

New procedure:

  1. Add the file to the GitLab data repo (https://gitlab.com/scikit-image/data) at a path matching the intended registry key (e.g., data/my_image.png)
  2. Create a new tagged release in the GitLab data repo (bump v1v2, etc.), or note the file will be included in the next data release
  3. Update _DATA_REPO_TAG in src/skimage/data/_registry.py to point to the new tag
  4. Add the file's SHA256 hash to the registry dict in _registry.py
  5. Add a loader function in src/skimage/data/_fetchers.py following the existing pattern
  6. Export the function from src/skimage/data/__init__.py and __init__.pyi
  7. Open a PR to the scikit-image GitHub repo with steps 3–6

Also note:

  • registry_urls is now computed automatically from registry — do not add entries there manually
  • To generate the SHA256 hash: python -c "import skimage.data; print(skimage.data.file_hash('path/to/file'))"
  • Data files must never be committed to the scikit-image GitHub repo

Verification

  1. python -m pip install -e ".[optional]" --no-build-isolation to install without bundled data
  2. Confirm src/skimage/data/ has no binary files: ls src/skimage/data/*.png should fail
  3. Confirm skimage.data.camera() downloads from CDN on first call and caches
  4. Confirm SKIMAGE_DATADIR=/tmp/mydata python -c "import skimage.data; skimage.data.camera()" works after pre-seeding /tmp/mydata
  5. Confirm python tools/download_data.py --dest /tmp/mydata --check-only exits 0 after download
  6. Run python -m pytest tests/skimage/data/ -v — all tests should pass (with network) or skip (without)
  7. Build a wheel with python -m build --wheel and confirm no .png/.tif etc. files inside it
@stefanv
Copy link
Copy Markdown

stefanv commented May 22, 2026

Thanks, this looks good. Not sure we need a fallback to pooch: it's a Python-only dependency. If we build a fallback, do we even need pooch 🤷

This does not address the best mechanism to host the files on GitLab, to ensure that we serve off of a static CDN.

We should also investigate whether using git-lfs is useful for people updating the repo. Otherwise cloning becomes a nightmare.

Filtering: I see there is a decision not to filter, but not sure I am totally opposed to filtering the history + using git-lfs to bring down skimage repo size. Maybe this will break too many things out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment