Skip to content

Instantly share code, notes, and snippets.

@amotl
Last active December 20, 2020 22:12
Show Gist options
  • Select an option

  • Save amotl/9fc67b696cbab9f0667be60de4dcf2be to your computer and use it in GitHub Desktop.

Select an option

Save amotl/9fc67b696cbab9f0667be60de4dcf2be to your computer and use it in GitHub Desktop.
Investigate problems when scanning huge directory tree of DWD CDC HTTP server
"""
Investigate problems when scanning huge directory
tree of DWD CDC HTTP server.
This repro will reveal that fsspec seems to randomly
include a single folder within its results list::
python fsspec-dwd.py | grep -v zip
The result varies between single-line outputs of e.g.:
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/1996
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/2010
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/2004
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/1996
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/2006
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/2000
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/1999
"""
import os
from fsspec.implementations.http import HTTPFileSystem
def process(url):
fs = HTTPFileSystem()
files = fs.find(url)
for name in files:
print(name)
if __name__ == "__main__":
large_folder = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/precipitation/historical/"
url = os.environ.get("DWD_URL", large_folder)
process(url)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment