- minimal JSON format to encode paths and file types
- cross language model support for python, c++, and scala
- Testing: can be used in cram tests
- Extend model to download resources in next iteration
File pbdatabundle.json
filetype
must be a valid PacBio File Metatypepath
can be supplied as a relative to the bundle JSON file or as an absolute pathid
valid format must match [A-z0-9]description
a terse description of the file
"bam01": {
"filetype": "PacBio.SubreadFile.SubreadBamFile",
"path": "subreadsets/01/file-01.bam",
"description": "Example BAM file from 3.1.0"
},
"barcodeset": {
"filetype": "PacBio.DataSet.BarcodeSet",
"path": "barcodes/pacbio-official-barcodeset.xml",
"description": "Official PacBio barcodes"
},
"lambda": {
"filetype": "PacBio.DataSet.ReferenceSet",
"path": "lambda/path-to-referenset.xml"
"description": "Lambda"
},
"s01": {
"filetype": "PacBio.DataSet.SubreadSet",
"path": "subreadsets/01/subreadset-01.xml"
"description": "Example SubreadSet v3.0.1"
},
"a01": {
"filetype": "PacBio.DataSet.AlignmentSet",
"path": "alignmentsets/01/consolidated-alignmentset-01.xml"
"description": "Example output AlignmentSet from SA 3.1.0"
},
"abam": {
"filetype": "PacBio.AlignmentFile.AlignmentBamFile",
"path": "alignmentsets/01/file.bam"
"description": "Example BAM"
}
}
- only depends on the python stdlib
- not embedded in a python package
- should be it's own repo (?), with multiple bundle JSON files, such as tiny.json, huge.json
$> pb-data get bam01 # Returns the path, by default looks in the current dir for pbdatabundle.json
$> pb-data get-type PacBio.DataSet.ReferenceSet # Returns a list of [id, path]
$> # Explicit path to bundle file
$> pb-data get bam01 --bundle=tiny.json
Alternatively, jq
could be used.
$> cat tiny.json | jq --raw-output '.bam01.path'
subreadsets/01/file-01.bam
-
It would be useful to add metadata about the size
-
For Version 1, I would suggest that rsync'ing or scp from a common dir to internal to files. Currently, some of the scala tests are using
/mnt/secondary/Share/smrtserver-testdata
. Version 2 should support remote URI for each file resource (i.e., add a new key "remote"). -
This abstraction could be used by the SL "canned" data to import data using
pbservice
-
Depending on the implementation of "remote", the SL job services could emit a data bundle.json file. This could be used to download all the job files (or subset of desired files) from SL services.