Skip to content

Instantly share code, notes, and snippets.

@mpkocher
Last active June 15, 2016 15:07
Show Gist options
  • Save mpkocher/cabd38717ee04d2cb129acd149d68cb5 to your computer and use it in GitHub Desktop.
Save mpkocher/cabd38717ee04d2cb129acd149d68cb5 to your computer and use it in GitHub Desktop.
Automate Analysis Job Submission to SMRT Link/Analysis

Automation of Analysis Job Submission

Analysis Jobs can submitted using pbservice using a JSON file that specifies the pipeline template id and the entry points as UUID.

pbservice run-analysis --host smrtlink-bihourly --port 8081 --block --debug /path/to/analysis-job.json

The analysis-job.json encodes the pipeline template and DataSet entry points that should be used. The DataSets are referenced by UUID and pipeline entry point point id/label are dependent on a specific pipeline id (e.g., pbsmrtpipe.pipelines.resequencing).

The --block option will block the process and poll the job status. In the absence of the --block, the job will only be submitted.

Example analysis-job.json using a SAT Pipeline

{
    "entryPoints": [
        {
            "_comment": "datasetId can be provided as the DataSet UUID or Int",
            "datasetId": "a1ed744a-3305-11e6-9a7e-3c15c2cc8f88",
            "entryId": "eid_ref_dataset",
            "fileTypeId": "PacBio.DataSet.ReferenceSet"
        },
        {
            "datasetId": "b02ee3ea-3305-11e6-ad05-3c15c2cc8f88",
            "entryId": "eid_subread",
            "fileTypeId": "PacBio.DataSet.SubreadSet"
        }
    ],
    "name": "My Job Name",
    "_comment": "The entryId(s) can be obtained by running 'pbsmrtpipe show-template-details {PIPELINE-ID}'",
    "pipelineId": "pbsmrtpipe.pipelines.sat",
    "taskOptions": [],
    "workflowOptions": []
}

Details (and to verify) that the DataSet has been imported can be obtained using pbservice.

Here's an example of looking up a SubreadSet by UUID 6fa4ff78-9b6b-472e-a2f2-a66e730ff183

$> pbservice get-dataset 6fa4ff78-9b6b-472e-a2f2-a66e730ff183 --host smrtlink-beta --port 8081 --quiet
{ u'comments': u' ',
  u'createdAt': u'2016-06-14T20:26:53.897Z',
  u'id': 19277,
  u'isActive': True,
  u'jobId': 12491,
  u'md5': u'b5cb1bf74146c9acb12b48491bd2c3c1',
  u'name': u'SMS_FleaBenchmark_15kbEcoli_trypsin_A8_061316',
  u'numRecords': 575772,
  u'path': u'/pbi/collections/315/3150259/r54008_20160614_021104/2_B01/m54008_160614_093245.subreadset.xml',
  u'projectId': 1,
  u'tags': u'subreadset',
  u'totalLength': 3020667250,
  u'updatedAt': u'2016-06-14T20:26:53.897Z',
  u'userId': 1,
  u'uuid': u'6fa4ff78-9b6b-472e-a2f2-a66e730ff183',
  u'version': u'3.0.1'}

Getting Details about Pipelines

Using pbsmrtpipe exe you can get details about specific pipeline ids, such as the available task options (with default values) and entry points of the pipeline. Specifically, the entry point id/labels of the pipeline as important to bind the required datasets to run the pipeline.

Example of getting a list of Pipeline Templates available.

$> pbsmrtpipe show-templates | tail -5
 48. Convert BAM to FASTX           pbsmrtpipe.pipelines.sa3_ds_subreads_to_fastx
 49. RS Movie to Subread DataSet    pbsmrtpipe.pipelines.sa3_fetch
 50. Convert RS to BAM              pbsmrtpipe.pipelines.sa3_hdfsubread_to_subread
 51. RS movie Resequencing          pbsmrtpipe.pipelines.sa3_resequencing
 52. Site Acceptance Test (SAT)     pbsmrtpipe.pipelines.sa3_sat

Getting Details about a specific pipeline template by id.

$> pbsmrtpipe show-template-details pbsmrtpipe.pipelines.sa3_sat
Registry Loaded. Number of ToolContracts:142 FileTypes:56 ChunkOperators:17 Pipelines:52
Pipeline id   : pbsmrtpipe.pipelines.sa3_sat
Pipeline name : Site Acceptance Test (SAT)
Description   : 
    Site Acceptance Test - lambda genome resequencing used to validate new
    PacBio installations
    
Tags          : mapping,consensus,reports,sat 
Entry points  : 2
********************
$entry:eid_subread
$entry:eid_ref_dataset

Bindings      : 23
********************
          genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.gff2vcf:0
    genomic_consensus.tasks.summarize_consensus:0 -> pbreports.tasks.variants_report:1
          genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.summarize_consensus:1
          genomic_consensus.tasks.variantcaller:0 -> pbreports.tasks.variants_report:2
                pbcoretools.tasks.filterdataset:0 -> pbalign.tasks.pbalign:0
                          pbalign.tasks.pbalign:0 -> pbreports.tasks.summarize_coverage:0
                          pbalign.tasks.pbalign:0 -> pbreports.tasks.mapping_stats:0
          genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.gff2bed:0
                          pbalign.tasks.pbalign:0 -> genomic_consensus.tasks.variantcaller:0
             pbreports.tasks.summarize_coverage:0 -> pbreports.tasks.coverage_report:1
          genomic_consensus.tasks.variantcaller:0 -> pbreports.tasks.top_variants:0
             pbreports.tasks.summarize_coverage:0 -> genomic_consensus.tasks.summarize_consensus:0
                          pbalign.tasks.pbalign:0 -> pbalign.tasks.consolidate_alignments:0
                pbreports.tasks.variants_report:0 -> pbreports.tasks.sat_report:1
                          pbalign.tasks.pbalign:0 -> pbreports.tasks.sat_report:0
                  pbreports.tasks.mapping_stats:0 -> pbreports.tasks.sat_report:2
                           $entry:eid_ref_dataset -> pbreports.tasks.variants_report:0
                           $entry:eid_ref_dataset -> pbreports.tasks.top_variants:1
                           $entry:eid_ref_dataset -> pbreports.tasks.coverage_report:0
                               $entry:eid_subread -> pbcoretools.tasks.filterdataset:0
                           $entry:eid_ref_dataset -> pbreports.tasks.summarize_coverage:1
                           $entry:eid_ref_dataset -> genomic_consensus.tasks.variantcaller:1
                           $entry:eid_ref_dataset -> pbalign.tasks.pbalign:1
Default Task Options
'pbalign.task_options.concordant' -> True
'genomic_consensus.task_options.diploid' -> False
'pbalign.task_options.algorithm_options' -> --minMatch 12 --bestn 10 --minPctSimilarity 70.0 --refineConcordantAlignments
'genomic_consensus.task_options.algorithm' -> plurality

Therefore, when you want to use pipeline pbsmrtpipe.pipelines.sat, the analysis-job.json must have eid_subread and eid_ref_dataset as entry points.

Here's an example of an analysis-job.json for a SAT pipeline with a SubreadSet b02ee3ea-3305-11e6-ad05-3c15c2cc8f88 and ReferenceSet 1ed744a-3305-11e6-9a7e-3c15c2cc8f88.

{
    "entryPoints": [
        {
            "datasetId": "a1ed744a-3305-11e6-9a7e-3c15c2cc8f88",
            "entryId": "eid_ref_dataset",
            "fileTypeId": "PacBio.DataSet.ReferenceSet"
        },
        {
            "datasetId": "b02ee3ea-3305-11e6-ad05-3c15c2cc8f88",
            "entryId": "eid_subread",
            "fileTypeId": "PacBio.DataSet.SubreadSet"
        }
    ],
    "name": "My Custom SAT Job Name",
    "pipelineId": "pbsmrtpipe.pipelines.sat",
    "taskOptions": [],
    "workflowOptions": []
}

And now submit the job to SMRT Link.

$> pbservice run-analysis --host smrtlink-bihourly --port 8081 --debug analysis-job.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment