Skip to content

Instantly share code, notes, and snippets.

@mpkocher
Last active May 31, 2016 19:36
Show Gist options
  • Save mpkocher/79333f8e1f9059914b8bdcdee3f17cea to your computer and use it in GitHub Desktop.
Save mpkocher/79333f8e1f9059914b8bdcdee3f17cea to your computer and use it in GitHub Desktop.
PacBio Data Bundle Proposal

PacBio Data Bundle Proposal

  • minimal JSON format to encode paths and file types
  • cross language model support for python, c++, and scala
  • Testing: can be used in cram tests
  • Extend model to download resources in next iteration

Example

File pbdatabundle.json

  • filetype must be a valid PacBio File Metatype
  • path can be supplied as a relative to the bundle JSON file or as an absolute path
  • id valid format must match [A-z0-9]
  • description a terse description of the file
  "bam01": {
    "filetype": "PacBio.SubreadFile.SubreadBamFile",
    "path": "subreadsets/01/file-01.bam",
    "description": "Example BAM file from 3.1.0"
  },
  "barcodeset": {
    "filetype": "PacBio.DataSet.BarcodeSet",
    "path": "barcodes/pacbio-official-barcodeset.xml",
    "description": "Official PacBio barcodes"
  },
  "lambda": {
    "filetype": "PacBio.DataSet.ReferenceSet",
    "path": "lambda/path-to-referenset.xml"
    "description": "Lambda"
  },
  "s01": {
    "filetype": "PacBio.DataSet.SubreadSet",
    "path": "subreadsets/01/subreadset-01.xml"
    "description": "Example SubreadSet v3.0.1"
  },
  "a01": {
    "filetype": "PacBio.DataSet.AlignmentSet",
    "path": "alignmentsets/01/consolidated-alignmentset-01.xml"
    "description": "Example output AlignmentSet from SA 3.1.0"
  },
  "abam": {
    "filetype": "PacBio.AlignmentFile.AlignmentBamFile",
    "path": "alignmentsets/01/file.bam"
    "description": "Example BAM"
  }
}

Thin Python CLI tool to interact with the file

  • only depends on the python stdlib
  • not embedded in a python package
  • should be it's own repo (?), with multiple bundle JSON files, such as tiny.json, huge.json
$> pb-data get bam01 # Returns the path, by default looks in the current dir for pbdatabundle.json
$> pb-data get-type PacBio.DataSet.ReferenceSet # Returns a list of [id, path]
$> # Explicit path to bundle file
$> pb-data get bam01 --bundle=tiny.json

Alternatively, jq could be used.

$> cat tiny.json | jq --raw-output '.bam01.path'
subreadsets/01/file-01.bam

Misc

  1. It would be useful to add metadata about the size

  2. For Version 1, I would suggest that rsync'ing or scp from a common dir to internal to files. Currently, some of the scala tests are using /mnt/secondary/Share/smrtserver-testdata. Version 2 should support remote URI for each file resource (i.e., add a new key "remote").

  3. This abstraction could be used by the SL "canned" data to import data using pbservice

  4. Depending on the implementation of "remote", the SL job services could emit a data bundle.json file. This could be used to download all the job files (or subset of desired files) from SL services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment