PacBio Data Bundle Proposal

minimal JSON format to encode paths and file types
cross language model support for python, c++, and scala
Testing: can be used in cram tests
Extend model to download resources in next iteration

Example

File pbdatabundle.json

filetype must be a valid PacBio File Metatype
path can be supplied as a relative to the bundle JSON file or as an absolute path
id valid format must match [A-z0-9]
description a terse description of the file

  "bam01": {
    "filetype": "PacBio.SubreadFile.SubreadBamFile",
    "path": "subreadsets/01/file-01.bam",
    "description": "Example BAM file from 3.1.0"
  },
  "barcodeset": {
    "filetype": "PacBio.DataSet.BarcodeSet",
    "path": "barcodes/pacbio-official-barcodeset.xml",
    "description": "Official PacBio barcodes"
  },
  "lambda": {
    "filetype": "PacBio.DataSet.ReferenceSet",
    "path": "lambda/path-to-referenset.xml"
    "description": "Lambda"
  },
  "s01": {
    "filetype": "PacBio.DataSet.SubreadSet",
    "path": "subreadsets/01/subreadset-01.xml"
    "description": "Example SubreadSet v3.0.1"
  },
  "a01": {
    "filetype": "PacBio.DataSet.AlignmentSet",
    "path": "alignmentsets/01/consolidated-alignmentset-01.xml"
    "description": "Example output AlignmentSet from SA 3.1.0"
  },
  "abam": {
    "filetype": "PacBio.AlignmentFile.AlignmentBamFile",
    "path": "alignmentsets/01/file.bam"
    "description": "Example BAM"
  }
}

Thin Python CLI tool to interact with the file

only depends on the python stdlib
not embedded in a python package
should be it's own repo (?), with multiple bundle JSON files, such as tiny.json, huge.json

$> pb-data get bam01 # Returns the path, by default looks in the current dir for pbdatabundle.json
$> pb-data get-type PacBio.DataSet.ReferenceSet # Returns a list of [id, path]
$> # Explicit path to bundle file
$> pb-data get bam01 --bundle=tiny.json

Alternatively, jq could be used.

$> cat tiny.json | jq --raw-output '.bam01.path'
subreadsets/01/file-01.bam

Misc

It would be useful to add metadata about the size
For Version 1, I would suggest that rsync'ing or scp from a common dir to internal to files. Currently, some of the scala tests are using /mnt/secondary/Share/smrtserver-testdata. Version 2 should support remote URI for each file resource (i.e., add a new key "remote").
This abstraction could be used by the SL "canned" data to import data using pbservice
Depending on the implementation of "remote", the SL job services could emit a data bundle.json file. This could be used to download all the job files (or subset of desired files) from SL services.

mpkocher/DataBundleProposal.md

PacBio Data Bundle Proposal

Example

Thin Python CLI tool to interact with the file

Misc