Skip to content

Instantly share code, notes, and snippets.

@manning-ncsa
Last active January 24, 2025 19:18
Show Gist options
  • Save manning-ncsa/c45dd892d41125ec98c4a6b26240a639 to your computer and use it in GitHub Desktop.
Save manning-ncsa/c45dd892d41125ec98c4a6b26240a639 to your computer and use it in GitHub Desktop.
DES Y6 BAO data set validator

DES Y6 BAO dataset validator

The Python script validator.py (see pip-compatible requirements file for dependencies) will compare the checksums for the data files associated with the Y6 BAO data release that are stored in the S3 bucket at URL:

https://ncsa.osn.xsede.org/phy240006-bucket01/despublic/y6a2_files/y6_bao/

These files are browsable via an HTTP proxy at https://desdr-server.ncsa.illinois.edu/despublic/y6a2_files/y6_bao/.

Run on your host machine

To run the validation script, first download the manifest file and execute the script as shown below:

$ python validator.py /path/to/downloaded/y6_bao_manifest.20250124.json
100%|█████████████████████████████████████████████| 2014/2014 [00:04<00:00, 409.76it/s]
Dataset is valid.

Run in Docker

Build and run the script following the commands below:

$ docker build . -t validator
...
 => exporting to image
 => => exporting layers
...

$ docker run --rm -it validator bash

root@bd85371657f5:/tmp# curl -O https://desdr-server.ncsa.illinois.edu/despublic/y6a2_files/y6_bao/y6_bao_manifest.20250124.json

root@bd85371657f5:/tmp# python validator.py y6_bao_manifest.20250124.json 

100%|███████████████████████████████████████████████████████████| 2014/2014 [00:57<00:00, 35.19it/s]
Dataset is valid.
FROM python:3.12
WORKDIR /tmp
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY validator.py .
import os
import sys
import json
from tqdm import tqdm
from minio import Minio
class ObjectStore:
def __init__(self) -> None:
'''Initialize S3 client'''
self.config = {
'endpoint-url': os.getenv("S3_ENDPOINT_URL", "https://ncsa.osn.xsede.org"),
'region-name': os.getenv("S3_REGION_NAME", ""),
'aws_access_key_id': os.getenv("AWS_S3_ACCESS_KEY_ID"),
'aws_secret_access_key': os.getenv("AWS_S3_SECRET_ACCESS_KEY"),
'bucket': os.getenv("S3_BUCKET", "phy240006-bucket01"),
}
self.bucket = self.config['bucket']
self.client = None
# If endpoint URL is empty, do not attempt to initialize a client
if not self.config['endpoint-url']:
return
if self.config['endpoint-url'].find('http://') != -1:
secure = False
endpoint = self.config['endpoint-url'].replace('http://', '')
elif self.config['endpoint-url'].find('https://') != -1:
secure = True
endpoint = self.config['endpoint-url'].replace('https://', '')
else:
print('endpoint URL must begin with http:// or https://')
return
self.client = Minio(
endpoint=endpoint,
access_key=self.config['aws_access_key_id'],
secret_key=self.config['aws_secret_access_key'],
region=self.config['region-name'],
secure=secure,
)
def object_info(self, path):
try:
response = self.client.stat_object(
bucket_name=self.bucket,
object_name=path)
return response
except Exception as err:
print(f'''Error fetching object info "{path}": {err}''')
return None
def validate_dataset_against_manifest(root_path='/', manifest_path='/tmp/y6_bao_manifest.20250124.json'):
s3 = ObjectStore()
with open(manifest_path) as fp:
manifest = json.load(fp)
root_path = root_path.strip('/')
mismatches = []
for obj_manifest in tqdm([obj for obj in manifest['objects'] if obj['type'] == 'file']):
path = os.path.join(root_path, obj_manifest['key'])
obj_dataset = s3.object_info(path=path)
if obj_dataset.etag != obj_manifest['etag']:
mismatches.append(
f'''Mismatched checksum values: {obj_dataset.etag} != {obj_manifest['etag']} '''
f'''for {obj_manifest['key']}.''')
if mismatches:
print('Dataset is not valid:')
print('\n'.join(mismatches))
else:
print('Dataset is valid.')
if __name__ == '__main__':
root_path = '/despublic/y6a2_files/y6_bao'
if len(sys.argv) > 1:
manifest_path = sys.argv[1]
validate_dataset_against_manifest(root_path=root_path, manifest_path=manifest_path)
sys.exit()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment