- GitLab data repo — consolidate all files under a canonical directory structure and create a tagged release (
v1) _registry.py— replace the 65-line hardcodedregistry_urlsdict with{k: _DATA_REPO_BASE + k for k in registry}meson.build— remove the 31 bundled legacy data files from the wheel_fetchers.py— drop the legacy bundled-file fallback; add a stdliburllibfallback for when pooch isn't installed; improve error messages; clean up the GitHub URL logicgit rm— remove all binary data files from the repo (no history rewrite)tools/download_data.py— new stdlib-only script for Linux distros to pre-download data and setSKIMAGE_DATADIR- Docs — repackager notice in
README.txt+ release notes; expanded "Adding Data" section inCONTRIBUTING.md
The scikit-image repo bundles 54+ data files (15 MB) in src/skimage/data/. Of these, 31 are explicitly packaged into the wheel via meson.build. The remainder are downloaded on demand via pooch from a GitLab CDN (https://gitlab.com/scikit-image/data/). The goal is to stop bundling any data files in the wheel/sdist, serve everything from the GitLab CDN, and provide a standalone download script for Linux distributions that need to package data offline.
Key findings:
- All 64
data/files have GitLab CDN URLs in_registry.py, but they reference 5 different commit SHAs — not tagged releases. Some files are also stored under different names/subdirectory paths in the GitLab data repo (e.g.,skin.jpg→Normal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG). - 6 orphan files exist in the directory but are not in the registry or referenced anywhere:
block.png,simple.fits,multi.fits,chessboard_GRAY_U8.npz,chessboard_RGB_U8.npy,chessboard_RGB_U8.npz - Data functions are used in production code ONLY in docstrings (not runtime code paths)
download_all()already exists in_fetchers.pybut requires pooch- Test data in
tests/skimage/color/data/(168 KB) andtests/skimage/registration/data/(296 KB) is separate and accessed directly — out of scope for this change
Current SHA groups in registry_urls:
5c090b56...— 50 files (bulk of standard images/arrays)2cdc5ce8...— 12 files (large scientific TIFs: brain, cells3d, kidney, lily, mitosis, etc.)b2bc880f...— 1 file (palisades_of_vogt.tif, stored asin-vivo-cornea-spots.tif)1e4f62ac...— 1 file (eagle.png)806548e1...— 1 file (gray_morph_output.npz, stored in a subdirectory)
src/skimage/data/meson.build— remove legacy data file installation blocksrc/skimage/data/_fetchers.py— remove legacy fallback, add urllib fallback, improve errorspyproject.toml— move pooch from[data]optional extra to a clearly documented optional deptests/skimage/data/test_data.py— addfetch()guards for formerly-bundled dataset teststools/download_data.py— new standalone download script
This must happen before removing files from the scikit-image repo.
The current registry_urls references 5 different commit SHAs. We need to consolidate all files under a single tagged release with canonical filenames.
Steps (done manually in the GitLab data repo):
- Gather all data files — they're still in
src/skimage/data/locally - Ensure every file exists in the GitLab repo with its canonical scikit-image name at the root level (e.g.,
skin.jpgnotNormal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG,gray_morph_output.npznot nested inTests_besides_Equalize_Otsu/) - Create a tag
v1(orv0.1) on the GitLab data repo pointing to a commit that contains all files - GitLab release URL pattern will be:
https://gitlab.com/scikit-image/data/-/raw/v1/{filename}
Update _registry.py to eliminate the hardcoded registry_urls dict entirely:
_DATA_REPO_TAG = "v1"
_DATA_REPO_BASE = f"https://gitlab.com/scikit-image/data/-/raw/{_DATA_REPO_TAG}/"
registry_urls = {k: _DATA_REPO_BASE + k for k in registry}The entire 65-line hardcoded registry_urls dict is replaced by a one-liner. The GitLab data repo must mirror the registry key paths exactly (e.g., data/camera.png, restoration/astronaut_rl.npy, color/data/lab_array_a_10.npy). When the data repo gets a new release (v2), only _DATA_REPO_TAG needs to change.
Files needing canonical name fixes (must match their skimage.data name):
data/palisades_of_vogt.tif(currentlyin-vivo-cornea-spots.tifin GitLab)data/skin.jpg(currentlyNormal_Epidermis_and_Dermis_with_Intradermal_Nevus_10x.JPG)data/protein_transport.tif(currentlyNPCsingleNucleus.tif)data/solidification.tif(currentlynickel_solidification.tif)data/kidney.tif(currentlykidney-tissue-fluorescence.tif)data/lily.tif(currentlylily-of-the-valley-fluorescence.tif)data/gray_morph_output.npz(currently inTests_besides_Equalize_Otsu/subdirectory)data/rank_filters_tests_3d.npz(currently inTests_besides_Equalize_Otsu/add18_entropy/)data/pivchallenge-B-B001_1.tif(currently inpivchallenge/B/subdirectory)data/pivchallenge-B-B001_2.tif(currently inpivchallenge/B/subdirectory)data/mitosis.tif(currentlyAS_09125_050116030001_D03f00d0.tif)restoration/astronaut_rl.npy(path isastronaut_rl.npyat root — needs to be inrestoration/dir or kept as flat name)
Remove the entire second py3.install_sources([...]) block (lines 15–53) that installs the 31 legacy data files. Move README.txt into the first python_sources list (keeps it installed in the wheel — needed by _ensure_cache_dir() which copies it to the local cache).
Add _get_default_cache_dir() helper:
def _get_default_cache_dir():
custom = os.environ.get('SKIMAGE_DATADIR')
if custom:
return custom
xdg_cache = os.environ.get('XDG_CACHE_HOME', osp.expanduser('~/.cache'))
return osp.join(xdg_cache, 'scikit-image')Add _urllib_fetch(url, dest_path, expected_hash) helper:
- Downloads with
urllib.request.urlopen(..., timeout=30) - Writes to
dest_path + '.part'then atomically renames viaos.replace() - Validates SHA256 hash; deletes
.partfile on mismatch - Raises
ConnectionErrorwith descriptive message pointing totools/download_data.pyandSKIMAGE_DATADIRon failure
Revise _fetch(data_filename, prefix=None):
- Remove "Case 2: file is present in legacy_data_dir" (lines 231–234)
- When not cached and need to download: use pooch if available, else
_urllib_fetch() - Both paths call
_skip_pytest_case_requiring_pooch()onConnectionError(rename this function to_skip_pytest_case_requiring_network()for accuracy) - Add a defensive
KeyErrorfor unknown filenames and aKeyErrorfor files missing fromregistry_urls
Error message for failed downloads (shown to users):
ConnectionError: Failed to download scikit-image dataset 'data/camera.png'.
To use scikit-image offline, pre-download all datasets:
Option 1 (with pooch):
python -c "import skimage.data; skimage.data.download_all('/path/to/cache')"
Option 2 (no extra deps):
python tools/download_data.py --dest /path/to/cache
Then set: export SKIMAGE_DATADIR=/path/to/cache
Update _create_image_fetcher():
- When pooch is absent, return
(None, _get_default_cache_dir())instead of(None, _LEGACY_DATA_DIR) - Remove the
prefixparameter entirely — theprefix='tests'call at module level is vestigial. Since all 64data/files have explicitregistry_urlsentries pointing to GitLab, thebase_urlfallback is never used. Remove the GitHub URL logic and setbase_urlto the GitLab base URL instead (or omit it and rely entirely onregistry_urls). - Module-level call becomes:
_image_fetcher, data_dir = _create_image_fetcher()
Update download_all():
- Remove the
ModuleNotFoundErrorbail-out when pooch is absent - Use
_urllib_fetch()as fallback so it works without pooch
Remove module-level names: _LEGACY_DATA_DIR and _DISTRIBUTION_DIR once the legacy Case 2 is gone. Update _ensure_cache_dir() to use osp.join(osp.dirname(__file__), 'README.txt') directly.
CLI interface:
python tools/download_data.py --dest /path/to/cache # download all
python tools/download_data.py --dest /path/to/cache --jobs 4 # parallel
python tools/download_data.py --dest /path/to/cache --check-only # verify hashes, exit 1 if any fail
python tools/download_data.py --list # print all files+URLs
python tools/download_data.py --dest /path/to/cache --file data/camera.png # single file
Key design:
- Imports
registryandregistry_urlsfromskimage.data._registry(pure Python, no C extensions) - Uses
urllib.request+hashlib— no extra dependencies - Parallel downloads via
concurrent.futures.ThreadPoolExecutorwhen--jobs > 1 - Prints
export SKIMAGE_DATADIR=/path/to/cacheat completion - Exit codes: 0 = success, 1 = download/hash failure, 2 = usage error
Prerequisite: Step 0 (GitLab tagged release) must be complete and _registry.py updated with tag-based URLs before this step.
git rm all binary data files from src/skimage/data/:
- All files currently listed in
meson.buildlines 19–52 - The 6 orphan files not in the registry
- All other binary files in the directory (remaining ~17 files already CDN-only)
- Keep:
__init__.py,__init__.pyi,_binary_blobs.py,_fetchers.py,_registry.py,README.txt,meson.build
Note: This does NOT rewrite git history (no git filter-repo). The files remain in git history but are removed from the working tree and future commits. A history rewrite is a separate, coordinated action for a later major release.
Move/update the pooch dependency note. The [data] extra currently contains pooch>=1.6.0. Update the description to clarify pooch is strongly recommended but optional (urllib fallback exists). Consider renaming [data] extra to [download] to better reflect its purpose.
Tests that call formerly-bundled data functions (e.g., data.camera(), data.coins()) now require a network round-trip on first run. Add a fetch('data/camera.png') call at the top of each such test so it is properly skipped in offline environments (consistent with existing pattern for test_eagle() etc.).
Add a prominent notice to src/skimage/data/README.txt and to the release notes / migration guide explaining the change for downstream repackagers (Linux distros, conda-forge, etc.):
NOTICE FOR REPACKAGERS
======================
As of scikit-image X.Y, data files are no longer bundled in the wheel or
source distribution. All sample datasets are now served from the GitLab CDN
at https://gitlab.com/scikit-image/data.
To package scikit-image for offline use (e.g., in a Linux distribution):
1. Download all data files during the package build step:
python tools/download_data.py --dest /usr/share/scikit-image/data --jobs 4
2. Set the SKIMAGE_DATADIR environment variable so scikit-image finds them:
export SKIMAGE_DATADIR=/usr/share/scikit-image/data
This can be set system-wide (e.g., /etc/profile.d/scikit-image.sh) or
in the test runner wrapper script.
3. The download script requires only Python stdlib (no pooch). It verifies
SHA256 hashes for all files and exits with code 1 on any failure.
Also add a .. deprecated:: / .. versionchanged:: notice in the skimage.data module docstring.
The existing "Adding Data" section (line 883) is brief. Expand it to document the full workflow for contributors adding a new dataset:
New procedure:
- Add the file to the GitLab data repo (
https://gitlab.com/scikit-image/data) at a path matching the intended registry key (e.g.,data/my_image.png) - Create a new tagged release in the GitLab data repo (bump
v1→v2, etc.), or note the file will be included in the next data release - Update
_DATA_REPO_TAGinsrc/skimage/data/_registry.pyto point to the new tag - Add the file's SHA256 hash to the
registrydict in_registry.py - Add a loader function in
src/skimage/data/_fetchers.pyfollowing the existing pattern - Export the function from
src/skimage/data/__init__.pyand__init__.pyi - Open a PR to the scikit-image GitHub repo with steps 3–6
Also note:
registry_urlsis now computed automatically fromregistry— do not add entries there manually- To generate the SHA256 hash:
python -c "import skimage.data; print(skimage.data.file_hash('path/to/file'))" - Data files must never be committed to the scikit-image GitHub repo
python -m pip install -e ".[optional]" --no-build-isolationto install without bundled data- Confirm
src/skimage/data/has no binary files:ls src/skimage/data/*.pngshould fail - Confirm
skimage.data.camera()downloads from CDN on first call and caches - Confirm
SKIMAGE_DATADIR=/tmp/mydata python -c "import skimage.data; skimage.data.camera()"works after pre-seeding/tmp/mydata - Confirm
python tools/download_data.py --dest /tmp/mydata --check-onlyexits 0 after download - Run
python -m pytest tests/skimage/data/ -v— all tests should pass (with network) or skip (without) - Build a wheel with
python -m build --wheeland confirm no.png/.tifetc. files inside it
Thanks, this looks good. Not sure we need a fallback to pooch: it's a Python-only dependency. If we build a fallback, do we even need pooch 🤷
This does not address the best mechanism to host the files on GitLab, to ensure that we serve off of a static CDN.
We should also investigate whether using git-lfs is useful for people updating the repo. Otherwise cloning becomes a nightmare.
Filtering: I see there is a decision not to filter, but not sure I am totally opposed to filtering the history + using git-lfs to bring down skimage repo size. Maybe this will break too many things out there.