Analysis Jobs can submitted using pbservice
using a JSON file that specifies the pipeline template id and the entry points as UUID.
pbservice run-analysis --host smrtlink-bihourly --port 8081 --block --debug /path/to/analysis-job.json
The analysis-job.json
encodes the pipeline template and DataSet
entry points that should be used. The DataSets are referenced by UUID and pipeline entry point point id/label are dependent on a specific pipeline id (e.g., pbsmrtpipe.pipelines.resequencing).
The --block
option will block the process and poll the job status. In the absence of the --block
, the job will only be submitted.
{
"entryPoints": [
{
"_comment": "datasetId can be provided as the DataSet UUID or Int",
"datasetId": "a1ed744a-3305-11e6-9a7e-3c15c2cc8f88",
"entryId": "eid_ref_dataset",
"fileTypeId": "PacBio.DataSet.ReferenceSet"
},
{
"datasetId": "b02ee3ea-3305-11e6-ad05-3c15c2cc8f88",
"entryId": "eid_subread",
"fileTypeId": "PacBio.DataSet.SubreadSet"
}
],
"name": "My Job Name",
"_comment": "The entryId(s) can be obtained by running 'pbsmrtpipe show-template-details {PIPELINE-ID}'",
"pipelineId": "pbsmrtpipe.pipelines.sat",
"taskOptions": [],
"workflowOptions": []
}
Details (and to verify) that the DataSet has been imported can be obtained using pbservice
.
Here's an example of looking up a SubreadSet by UUID 6fa4ff78-9b6b-472e-a2f2-a66e730ff183
$> pbservice get-dataset 6fa4ff78-9b6b-472e-a2f2-a66e730ff183 --host smrtlink-beta --port 8081 --quiet
{ u'comments': u' ',
u'createdAt': u'2016-06-14T20:26:53.897Z',
u'id': 19277,
u'isActive': True,
u'jobId': 12491,
u'md5': u'b5cb1bf74146c9acb12b48491bd2c3c1',
u'name': u'SMS_FleaBenchmark_15kbEcoli_trypsin_A8_061316',
u'numRecords': 575772,
u'path': u'/pbi/collections/315/3150259/r54008_20160614_021104/2_B01/m54008_160614_093245.subreadset.xml',
u'projectId': 1,
u'tags': u'subreadset',
u'totalLength': 3020667250,
u'updatedAt': u'2016-06-14T20:26:53.897Z',
u'userId': 1,
u'uuid': u'6fa4ff78-9b6b-472e-a2f2-a66e730ff183',
u'version': u'3.0.1'}
Using pbsmrtpipe
exe you can get details about specific pipeline ids, such as the available task options (with default values) and entry points of the pipeline. Specifically, the entry point id/labels of the pipeline as important to bind the required datasets to run the pipeline.
Example of getting a list of Pipeline Templates available.
$> pbsmrtpipe show-templates | tail -5
48. Convert BAM to FASTX pbsmrtpipe.pipelines.sa3_ds_subreads_to_fastx
49. RS Movie to Subread DataSet pbsmrtpipe.pipelines.sa3_fetch
50. Convert RS to BAM pbsmrtpipe.pipelines.sa3_hdfsubread_to_subread
51. RS movie Resequencing pbsmrtpipe.pipelines.sa3_resequencing
52. Site Acceptance Test (SAT) pbsmrtpipe.pipelines.sa3_sat
Getting Details about a specific pipeline template by id.
$> pbsmrtpipe show-template-details pbsmrtpipe.pipelines.sa3_sat
Registry Loaded. Number of ToolContracts:142 FileTypes:56 ChunkOperators:17 Pipelines:52
Pipeline id : pbsmrtpipe.pipelines.sa3_sat
Pipeline name : Site Acceptance Test (SAT)
Description :
Site Acceptance Test - lambda genome resequencing used to validate new
PacBio installations
Tags : mapping,consensus,reports,sat
Entry points : 2
********************
$entry:eid_subread
$entry:eid_ref_dataset
Bindings : 23
********************
genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.gff2vcf:0
genomic_consensus.tasks.summarize_consensus:0 -> pbreports.tasks.variants_report:1
genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.summarize_consensus:1
genomic_consensus.tasks.variantcaller:0 -> pbreports.tasks.variants_report:2
pbcoretools.tasks.filterdataset:0 -> pbalign.tasks.pbalign:0
pbalign.tasks.pbalign:0 -> pbreports.tasks.summarize_coverage:0
pbalign.tasks.pbalign:0 -> pbreports.tasks.mapping_stats:0
genomic_consensus.tasks.variantcaller:0 -> genomic_consensus.tasks.gff2bed:0
pbalign.tasks.pbalign:0 -> genomic_consensus.tasks.variantcaller:0
pbreports.tasks.summarize_coverage:0 -> pbreports.tasks.coverage_report:1
genomic_consensus.tasks.variantcaller:0 -> pbreports.tasks.top_variants:0
pbreports.tasks.summarize_coverage:0 -> genomic_consensus.tasks.summarize_consensus:0
pbalign.tasks.pbalign:0 -> pbalign.tasks.consolidate_alignments:0
pbreports.tasks.variants_report:0 -> pbreports.tasks.sat_report:1
pbalign.tasks.pbalign:0 -> pbreports.tasks.sat_report:0
pbreports.tasks.mapping_stats:0 -> pbreports.tasks.sat_report:2
$entry:eid_ref_dataset -> pbreports.tasks.variants_report:0
$entry:eid_ref_dataset -> pbreports.tasks.top_variants:1
$entry:eid_ref_dataset -> pbreports.tasks.coverage_report:0
$entry:eid_subread -> pbcoretools.tasks.filterdataset:0
$entry:eid_ref_dataset -> pbreports.tasks.summarize_coverage:1
$entry:eid_ref_dataset -> genomic_consensus.tasks.variantcaller:1
$entry:eid_ref_dataset -> pbalign.tasks.pbalign:1
Default Task Options
'pbalign.task_options.concordant' -> True
'genomic_consensus.task_options.diploid' -> False
'pbalign.task_options.algorithm_options' -> --minMatch 12 --bestn 10 --minPctSimilarity 70.0 --refineConcordantAlignments
'genomic_consensus.task_options.algorithm' -> plurality
Therefore, when you want to use pipeline pbsmrtpipe.pipelines.sat
, the analysis-job.json
must have eid_subread
and eid_ref_dataset
as entry points.
Here's an example of an analysis-job.json
for a SAT pipeline with a SubreadSet b02ee3ea-3305-11e6-ad05-3c15c2cc8f88
and ReferenceSet 1ed744a-3305-11e6-9a7e-3c15c2cc8f88
.
{
"entryPoints": [
{
"datasetId": "a1ed744a-3305-11e6-9a7e-3c15c2cc8f88",
"entryId": "eid_ref_dataset",
"fileTypeId": "PacBio.DataSet.ReferenceSet"
},
{
"datasetId": "b02ee3ea-3305-11e6-ad05-3c15c2cc8f88",
"entryId": "eid_subread",
"fileTypeId": "PacBio.DataSet.SubreadSet"
}
],
"name": "My Custom SAT Job Name",
"pipelineId": "pbsmrtpipe.pipelines.sat",
"taskOptions": [],
"workflowOptions": []
}
And now submit the job to SMRT Link.
$> pbservice run-analysis --host smrtlink-bihourly --port 8081 --debug analysis-job.json