mpkocher/preset.xml

SA3 Config

version 0.1.2

Overview

Case #1 Configuration Setup (with Cluster configuration)

Create (or Update) a pbsmrtpipe preset XML file to point to your the directory that contains your cluster templates. Update the option, pbsmrtpipe.options.cluster_manager to the path to the directory.
export required env or PATH to put cluster command (e.g., qsub) or custom cluster wrapper in your path
Create (or Update) Services config.json (details below) to point to your pbsmrtpipe preset xml
launch the services using $SMRT_ROOT/admin/start-services. This will put external tools, pbsmrtpipe, sawriter and samtools in your path before the services are launched.

Configuration Flow and Overview of Job execution

Users can now create job(s) in the UI or from the Services
the Service engine write a job.sh in the job directory
job.sh calls pbsmrtpipe to run a pipeline (or other analysis job types are run, e.g., merge-datasets)
pbsmrtpipe submits jobs to be run on the cluster resources by creating "cluster.sh"
"cluster.sh" renders the configured cluster template to call the underlying task (i.e., resolved tool contract)
pbsmrtpipe submits job status back to Services using the FQDN of the Service host (the cluster nodes must have network access to the host that running the services)
On pbsmrtpipe (or other job type analysis) completion, the Service engine will import the job datastore (and PacBio DataSets) into the db. Newly imported DataSets are accessible to be used in new pipelines.

Case #1 Configuration Setup (non-Cluster configuration)

Create (or Update) a pbsmrtpipe preset XML file to point to your the directory that contains your cluster templates. Update the option, pbsmrtpipe.options.cluster_manager to the path to the directory.
Create (or Update) Services config.json (details below) to point to your pbsmrtpipe preset xml
launch the services using $SMRT_ROOT/admin/start-services. This will put external tools, pbsmrtpipe, sawriter and samtools in your path before the services are launched.
Services must be configured to only run one analysis job type, PB_SERVICES_NWORKERS=1, and pbsmrtpipe.options.max_total_nproc to be ~ 1 - NPROC of the host (see details of the pbsmrtipe options below), otherwise the system can potentially oversubscribe the system

Configuration Flow and Overview of Job Execution

User create job(s) in the UI or from the Services
the Service engine write a job.sh in the job directory
job.sh calls pbsmrtpipe to run a pipeline (or other analysis job types are run, e.g., merge-datasets)
pbsmrtpipe runs the job locally on the same host
pbsmrtpipe submits job status back to Services using the FQDN of the Service host
On pbsmrtpipe (or other job type analysis) completion, the Service engine will import the job results datastore (and PacBio DataSets) into the db. Newly imported DataSets are accessible to be used in new jobs.

SMRT Link Common Services Config

SLCS supports job types; import-dataset, merge-dataset

The bundle config support setting the following options:

PB_SERVICES_PORT Port the services will run on
PB_JOB_ROOT Root Directory to the job (types: import-dataset, merge-dataset) will be run in (defaults to current working dir + "/jobs-root")
PB_TMP_DIR Temp directory used (defaults to TMP_DIR env var)

Example Services config.json:

{
    "PB_SERVICES_PORT": 8080,
    "PB_SERVICES_NWORKERS": 25,
    "PB_JOB_ROOT": "/absolute-path/to/jobs-root",
    "PB_TMP_DIR": "/path/to/tmp-dir"
}

SMRT Link Analysis Services Config

SLAS supports job types: import-datasets, merge-datasets, pbsmrtpipe (mock-pbsmrtpipe simple for development and testing)

In addition to the SLCS bundle options, the bundle config support setting the following options:

PB_TOOLS_ROOT Directory to root SMRT Analysis tools dir that contain pbsmrtpipe, samtools, sawriter
- Example: (/path/to/smrtcmds, no default, will inherit current ENV to find abspath to exes)
PB_SERVICES_NWORKERS number of concurrently running jobs import-dataset, merge-dataset, pbsmrtpipe
- Example (Int, 20)
PB_PRESET_XML This is the preset that will be used to override any pbsmrtpipe engine level options, such as nworkers, max nproc, total nproc, tmp directory, cluster template dir. Details of the supported options
- Example (/path/to/base-pbsmrtpipe-preset.xml, no default)

For running a system to submit jobs to an HPC scheduler, such as SGE, you must first configure PB_PRESET_XML to point to your cluster templates, then put the cluster exes (e.g., qsub) referenced in the cluster templates in your path before starting the SMRT Link Analysis Services.

Example:

$> export PATH=/path/to/sge/bin:$PATH # set SGE_ROOT or equivalent
$> $SMRT_ROOT/admin/start-services

(TODO) Alternatively, call start services with a setup.sh which can export any necessary enviornment for cluster exes defined in your cluster template to be accessible.

$> $SMRT_ROOT/admin/start-services /path/to/my-setup.sh

Testing the services started successfully

$> $SMRT_ROOT/admin/services-get-status

Example of config.json

{
    "PB_SERVICES_PORT": 8080,
    "PB_SERVICES_NWORKERS": 25,
    "PB_JOB_ROOT": "/absolute-path/to/jobs-root",
    "PB_PRESET_XML": "absolute-path/to/pbsmrtipe-preset.xml",
    "PB_TMP_DIR": "/path/to/tmp-dir"
}

Details of Configuration Model

The services will be envoked with an environment that contains pbsmrtpipe, samtools, sawriter exes in the path. This enables the services to run analysis pipelines using pbsmrtpipe and other analysis, such as fasta to ReferenceSet conversion.
For a pbsmrtpipe analysis job, a job.sh will be written to the job directory by the Service engine layer (described in a pervious section) with the resolved entry points to the pipeline

Example of the job.sh

#!/bin/bash
/absolute-path/to/smrtcmds/pbsmrtpipe pipeline /absolute-path-job-dir/workflow.xml  \
--debug  \
-e "eid_ref_dataset:/mnt/secondary-siv/references/lambdaNEB/lambdaNEB.referenceset.xml" \
-e "eid_subread:/mnt/secondary-siv/testdata/SA3-DS/lambda/2372215/0007_tiny/Analysis_Results/m150404_101626_42267_c100807920800000001823174110291514_s1_p0.all.subreadset.xml" \
--preset-xml="/absolute-path-job-dir/preset.xml"  \
--preset-xml="/absolute-path/pb-preset-env.xml" # This will override any values defined in workflow.xml and preset.xml \
--output-dir="/absolute-path-job-dir/job_output"

The preset.xml will be values that have been provided at the UI/Services level. The multiple instances of --preset-xml allow the last value given to override other values. Using the PB_PRESET_XML from the services config, allows a user to globally override the pipeline engine level options.

Example of PB_PRESET_XML that can be used by pbsmrtpipe. See Details or run pbsmrtpipe show-workflow-options from the commandline for more information.

<?xml version="1.0" ?>
<pipeline-template-preset>

    <!-- Pbsmrtpipe Engine Options -->
    <options>
    <!-- Enable Distributed Mode -->
        <option id="pbsmrtpipe.options.distributed_mode">
            <value>False</value>
        </option>
        <!-- Enable file chunking -->
        <option id="pbsmrtpipe.options.chunk_mode">
            <value>True</value>
        </option>
         <!-- This will be disabled if pbsmrtpipe.options.distributed_mode is False -->
        <option id="pbsmrtpipe.options.cluster_manager" >
            <value>/absolute-path/to/cluster-templates/</value>
        </option>

        <!-- Total Number of slocs/processors a pbsmrtpipe instance will use -->
        <option id="pbsmrtpipe.options.max_total_nproc" >
            <value>1000</value>
        </option>

        <!-- MAX Number of Processors per Task that will be used -->
        <option id="pbsmrtpipe.options.max_nproc">
            <value>24</value>
        </option>

          <!-- MAX Number of Chunks per Chunkable Task that will be used -->
        <option id="pbsmrtpipe.options.max_nchunks">
            <value>24</value>
        </option>

    </options>

    <!-- Default override for task options -->
    <task-options />

</pipeline-template-preset>

Each task to the job directory will either run the task "locally" as a subprocess of pbsmrtpipe or on the cluster resources
For local jobs, these light weight tasks are run by a call pbtools-runner which will call the resolved tool contract via the a Runnable Job. A Runnable Job is a resolved tool contract + job specific metadata (e.g., cluster, job id). The enviornment is inherited by the parent pbsmrtpipe process. For running in SMRT Analysis Suite, the paths are resolved the exes to $SMRT_ROOT/smrtcmds/bin.

"Runnable Job" Example:

pbtools-runner run --debug /absolute-path/job-output/tasks/pbreports.tasks.mapping_stats-0/resolved-tool-contract.json

For cluster jobs ill have a cluster.sh (i.e., resolved tool contracts that have distributed = True)

Cluster "Runnable Job" Example:

qsub -S /bin/bash -sync y -V -q default -N job.4361649pbalign.tasks.pbalign \
    -o "/home/UNIXHOME/mkocher/workspaces/mkocher_server_testkit_jobs/testkit-jobs/sa3_pipelines/mapping/tiny_ds/job_output/tasks/pbalign.tasks.pbalign-1/cluster.stdout" \
    -e "/home/UNIXHOME/mkocher/workspaces/mkocher_server_testkit_jobs/testkit-jobs/sa3_pipelines/mapping/tiny_ds/job_output/tasks/pbalign.tasks.pbalign-1/cluster.stderr" \
    -pe smp 5 \
    "/home/UNIXHOME/mkocher/workspaces/mkocher_server_testkit_jobs/testkit-jobs/sa3_pipelines/mapping/tiny_ds/job_output/tasks/pbalign.tasks.pbalign-1/run.sh"

Where the run.sh will call pbtestkit-runner to run the resolved tool contract.

Explicit Example of run.sh:

/absolute-path/smrtcmds/pbtools-runner run --debug /absolute-path/job-output/tasks/pbreports.tasks.mapping_stats-0/resolved-tool-contract.json

Cluster Template

Details of the cluster manager configuration

The cluster template model allows for users to take the task metadata and call the cluster submission layer via a simple template layer. There are two require files, "start.tmpl" and "stop.tmpl"

An example of the "start.tmpl" which dispatches to the "default" queue.

qsub -pe smp ${NPROC} -S /bin/bash -V -q default -N ${JOB_ID} \
    -o ${STDOUT_FILE}\
    -e ${STDERR_FILE}\
    ${CMD}

Custom Cluster Template

The template model allows for you to define your own cluster template with a call to run a cluster submission. Often there's a need to take the job resources (e.g, NPROC) and route that job to a specific queue. Adding a simple bash script can enable this. First, define a new cluster template that calls your custom bash script.

Custom Template "start.tmpl"

custom-qsub.sh ${JOB_ID} ${NPROC} ${STDOUT_FILE} ${STDERR_FILE} ${CMD}

Where custom-qsub.sh is a simple bash script that determines the queue to submit to and call the 'raw' qsub command.

Example that dispatches to small_queue if the NPROC is 1, else the queue is set to default_queue.

#!/usr/bin/env bash
set -o errexit
set -o pipefail
set -o nounset
# set -o xtrace

display_usage() {
	echo "Usage: $0 [JOB_ID] [JOB_NPROC] [JOB_STDOUT] [JOB_STDERR] [JOB_CMD]";
}

# Simple example to submit jobs to a 
# arg: nproc 
get_job_queue() {
	if [[ "x$1" == "x1" ]]; then
		echo "small_queue"
	else
		echo "default_queue"
	fi
}

if [ $# -eq 5 ]; then
	job_id=$1
	job_nproc=$2
	job_stdout=$3
	job_stderr=$4
	job_cmd=$5
	job_queue=`get_job_queue $job_nproc`
	echo "Running job-id ${job_id} with job-queue $job_queue nproc ${job_nproc}"
	# Actually Call QSUB here
	cluster_cmd="qsub -pe smp ${job_nproc} -S /bin/bash -q ${job_queue} -N ${job_id} -o ${job_stdout} -e ${job_stderr} $job_cmd"
	echo "Mock running command '${cluster_cmd}'"
	exit 0
else
	display_usage
	exit 1
fi

For demonstration purposes, this example has minimal to no error handling. To use this feature, it's imperative to make this custom dispatching layer robust and easy to debug.

JGI

Special use case that needs a configurable generation of the job.sh produced by the services. This should be consider a private method that is only used by JGI.

To solve this, the "cluster" template abstraction (which is effectively a "cmd" template) can be reused to generate the custom job.sh

Service level job.sh "cluster" template

qsub -sync yes -pe smp ${NPROC} -S /bin/bash -V -q default -N ${JOB_ID} ${CMD}

This must block using SGE sync or equivalent

Resolved Example job.sh

qsub -sync yes -pe smp 24 -S /bin/bash -V -q default -N j.1234 "/absolute-path/to/pbsmrtpipe pipeline --debug ..."

Details of the resolved values determined at the Services level

NPROC value is max nproc used from preset.xml
the JOB_ID is formatted similarly to pbsmrtpipe generated job id
CMD is the full path to pbsmrtpipe command
STDOUT_FILE and STDERR_FILE should not be used in the template (FIXME: Not quite clear how to handle this)
pbsmrtpipe.options.distributed_mode in PB_PRESET_XML must be set to False

(TODO) Configuring the job.sh "cluster" template in the config.json:

PB_SERVICES_CMD_TEMPLATE Path to directory with start.tmpl

	<?xml version="1.0" ?>
	<pipeline-template-preset id="MyPreset">

	<!-- Pbsmrtpipe Engine Options -->
	<options>
	<!-- Enable Distributed Mode -->
	<option id="pbsmrtpipe.options.distributed_mode">
	<value>False</value>
	</option>
	<!-- Enable file chunking -->
	<option id="pbsmrtpipe.options.chunk_mode">
	<value>True</value>
	</option>
	<!-- This will be disabled if pbsmrtpipe.options.distributed_mode is False -->
	<option id="pbsmrtpipe.options.cluster_manager" >
	<value>/absolute-path/to/cluster-templates/</value>
	</option>

	<!-- Total Number of slocs/processors a pbsmrtpipe instance will use -->
	<option id="pbsmrtpipe.options.max_total_nproc" >
	<value>1000</value>
	</option>

	<!-- MAX Number of Processors per Task that will be used -->
	<option id="pbsmrtpipe.options.max_nproc">
	<value>24</value>
	</option>

	<!-- MAX Number of Chunks per Chunkable Task that will be used -->
	<option id="pbsmrtpipe.options.max_nchunks">
	<value>24</value>
	</option>

	</options>

	<!-- Default override for task options -->
	<task-options />

	</pipeline-template-preset>