Achieving cross-platform portability requires a combination of standards, platform support and workflow conformance. One of the goals of the challenge is to assess and develop each of these areas: improve standards to avoid interoperability challenges, improve platforms to better handle cases where workflow runs differ, and provide better workflows. This section describes experiences from workflow authors on areas where they found interoperability challenges and how they overcame them.
bcbio provides community build workflows for variant calling, RNA-seq, ChIP-seq and small RNA analysis. It generates CWL workflows from a standard user specified template language, rather than using hand written pre-built workflows. This flexibility enables customization of analyses: for instance, users can flexibly choose multiple SNP, indel and structural variant callers.
Due to the generalization of bcbio, our main two challenges were in representing more complex objects in a general way:
-
Combining multiple related files. To handle multiple samples with batches, like in tumor/normal somatic calling, bcbio required approaches to combine multiple related inputs like the BAM file, associated coverage regions, sample description and supplementary files. Attempting to do this via lists of files proved problematic since flattening and specifying these on the command line gets handled differently between platforms. We used CWL records and structured input files to overcome this limitations.
-
Handling complex multi-file inputs. Directories of related files like bwa indices or SNPeff databases were challenging to pass reliably in non-shared filesystem environments due to complexities of referencing the files. For sets of related files that aren't explicit secondary files like bam and bam.bai indices we combined and compressed sets of files that get unpacked as part of processing.
bcbio workflows ran consistently across both local platforms (Rabix Bunny), high performance computing environments with shared filesystems (Toil) and in cloud environments (DNAnexus, SevenBridges, Arvados). The major improvements we made in our workflows to handle the challenges listed are:
-
CWL records. We incorporated structured objects to represent multiple files passed together, which improved the understandability of our workflows.
-
Structured input files. Instead of serializing a complex set of inputs on the command line, we instead pass inputs to our process using structured JSON input files. These allow easy uptake into sub-processes and also enhanced our ability to debug failed runs by having an easy to understand representation of the inputs.
-
Compression of complex index inputs. For associated files like snpEff indices which have a complex set of related files, we now prepare compressed tarballs and the unpack these during running. This reduces complexity in the workflow files but also requires extra work in preparing and unpacking these duing processing.