Skip to content

Instantly share code, notes, and snippets.

@ionox0
Last active August 13, 2019 17:49
Show Gist options
  • Save ionox0/eeef798ec7001b82cf1c734df07d5cfc to your computer and use it in GitHub Desktop.
Save ionox0/eeef798ec7001b82cf1c734df07d5cfc to your computer and use it in GitHub Desktop.
Debugging a failed job in the Toil jobstore

From the Toil log files, you will be able to find the Toil job ID for the failed job. Here we show a job being submitted for the third time after two failed attempts. We can see the cwl file that defines the job, along with resource requirements, and the bsub command that is issued.

1.

There is then logging to indicate the failure of the job with job ID 9ntS9h.

DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Issued the job command: /home/johnsoni/virtualenvs/pipeline_1.1.14/bin/_toil_worker file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl file:/home/johnsoni/juno_ACCESS/5500-FZ/5500-FZ-1.1.14/tmp/jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a u/v/job9ntS9h with job id: 45
INFO:toil.leader:Issued job 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h with job batch system ID: 45 and cores: 2, disk: 20.5 G, and memory: 15.6 G

...

DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:Running ['bsub', '-cwd', '.', '-J', 'toil_job_45', '-R', 'select[mem > 7] rusage[mem=7]', '-M', '7', '-n', '2', '-W', '1200', '-S', '1', '-app', 'anyOS', '-R', 'select[type==CentOS7]', '/home/johnsoni/virtualenvs/pipeline_1.1.14/bin/_toil_worker file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl file:/home/johnsoni/juno_ACCESS/5500-FZ/5500-FZ-1.1.14/tmp/jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a u/v/job9ntS9h']

...

DEBUG:toil.batchSystems.abstractGridEngineBatchSystem:UpdatedJobsQueue Item: (45, 1)
WARNING:toil.leader:Job failed with exit value 1: 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h
DEBUG:toil.leader:Job 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h continues to exist (i.e. has more to do)
WARNING:toil.leader:No log file is present, despite job failing: 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h
WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h with ID u/v/job9ntS9h to 0
DEBUG:toil.leader:Added job: 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h to active jobs

2.

Using the job ID, use the find command to locate the temp dir for the job that failed.

The job file is a python pickle object that represents the metadata for the job (job ID, job name, retry count, resource requirements).

(pipeline_1.1.14)  accessbot@juno /home/johnsoni/juno_ACCESS/5500-FZ/5500-FZ-1.1.14/tmp > find . | grep 9ntS9h
./jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a/tmp/u/v/job9ntS9h
./jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a/tmp/u/v/job9ntS9h/g
./jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a/tmp/u/v/job9ntS9h/g/tmpgSv_cO.tmp
./jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a/tmp/u/v/job9ntS9h/job

3.

Use an environment (conda, virtualenv, etc.) that has Toil installed. This is so that you can import Toil and get the classes that are required for unpickling (deserializing) the failed Job object

accessbot@juno /home/johnsoni/juno_ACCESS/5500-FZ/5500-FZ-1.1.14/tmp > source /home/johnsoni/virtualenvs/pipeline_1.1.14/bin/activate

4.

We can now use the python interpreter to unpickle and inspect the job object. Here we see the job's name, retry count, and stack of subsequent jobs to run:

>>> j = pickle.load(open('./jobstore-2888ebd4-bd4a-11e9-a01d-ec0d9a88a15a/tmp/u/v/job9ntS9h/job', 'rb'))

>>> dir(j)
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__hash__', '__init__', '__long__', '__module__', '__native__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_config', '_cores', '_disk', '_memory', '_parseResource', '_preemptable', '_requirements', 'chainedJobs', 'checkpoint', 'checkpointFilesToDelete', 'command', 'cores', 'disk', 'displayName', 'errorJobStoreID', 'filesToDelete', 'fromJob', 'fromJobGraph', 'fromJobNode', 'getLogFileHandle', 'jobName', 'jobStoreID', 'logJobStoreFileID', 'memory', 'next', 'predecessorNumber', 'predecessorsFinished', 'preemptable', 'remainingRetryCount', 'restartCheckpoint', 'services', 'setupJobAfterFailure', 'stack', 'startJobStoreID', 'terminateJobStoreID', 'unitName']

>>> j.jobName
'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl'

>>> j.remainingRetryCount
0

>>> j.stack
[[], [JobNode( **{'jobStoreID': 'Q/S/job0aV3AM', '_config': None, 'displayName': 'JobGraph', 'predecessorNumber': 2, 'unitName': None, '_preemptable': False, 'jobName': 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/bwa-mem/bwa-mem.cwl', '_disk': 22045261824, '_cores': 4, 'command': None, '_memory': 31457280000} )]]

5.

Here is the result of printing the job (after formatting):

JobGraph( **{
	'predecessorNumber': 1
	'startJobStoreID': None
	'_preemptable': False
	'errorJobStoreID': None
	'remainingRetryCount': 0
	'filesToDelete': []
	'checkpointFilesToDelete': None
	'checkpoint': None
	'_cores': 2
	'logJobStoreFileID': 'u/v/job9ntS9h/g/tmpgSv_cO.tmp'
	'jobStoreID': 'u/v/job9ntS9h'
	'unitName': None
	'chainedJobs': ["'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl' u/v/job9ntS9h"]
	'services': []
	'predecessorsFinished': set([])
	'stack': [
		[]
		[
			JobNode( **{
				'jobStoreID': 'Q/S/job0aV3AM'
				'_config': None
				'displayName': 'JobGraph'
				'predecessorNumber': 2
				'unitName': None
				'_preemptable': False
				'jobName': 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/bwa-mem/bwa-mem.cwl'
				'_disk': 22045261824
				'_cores': 4
				'command': None
				'_memory': 31457280000}
			)
		]
	]
	'_config': None
	'displayName': 'JobGraph'
	'jobName': 'file:///home/johnsoni/pipeline_1.1.14/ACCESS-Pipeline/cwl_tools/trimgalore/trimgalore.cwl'
	'_disk': 22045261824
	'command': '_toil q/P/jobsdmYZF/g/tmpNWSsZJ-_serialiseJob-stream /home/johnsoni/virtualenvs/pipeline_1.1.14/lib/python2.7/site-packages toil.cwl.cwltoil True'
	'_memory': 16777216000
	'terminateJobStoreID': None
} )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment