clintval/README.md

Exome Pipeline

QC Illumina flowcell, demultiplex, and QC mapped BAM

Installation · Usage · Future Improvements

Installation

The main pipeline multi-sample-run-level-pipeline.Snakefile depends on bwa, Picard, Python 3.6+, and R.

Usage

❯ snakemake \
    -s multi-sample-run-level-pipeline.Snakefile \
    --cores ${max_cores} \
    --rerun-incomplete \
    --retry-times ${retry_times} \
    --config \
        run_folder=${run_folder} \
        sample_sheet=${sample_sheet} \
        reference=${reference} \
        bait_intervals=${bait_intervals} \
        run_output=${run_output}

Tool Argument Conversion:

All wrapped tools support on-the-fly Python to tool argument conversion.

For example, the following conversion is applied for a typical Picard tool:

>>> params:
>>>     create_index=True,
>>>     output_extension=None,
>>>     adapters_to_check=['INDEXED', 'FLUIDIGM']
# Represented as the following CLI expansion
"""
picard tool \
    CREATE_INDEX=true \
    OUTPUT_EXTENSION=null \
    ADAPTERS_TO_CHECK=INDEXED
    ADAPTERS_TO_CHECK=FLUIDIGM
"""

A similar example for bwa:

>>> params:
>>>     v=2,
>>>     p=True
# Represented as the following CLI expansion
'bwa mem -p -v2'

DAG:

A graph representation of this pipeline for analyzing one sample:

Resources:

All Picard tool support the following Snakemake resource objects. Boolean choices must be represented as an integer binary assignment e.g. use_async_io_read_samtools=1.

-XX:GCHeapFreeLimit={gc_heap_free_limit}
-XX:GCTimeLimit={gc_time_limit}
-Xmx{malloc}m
-Dsamjdk.buffer_size={samjdk_buffer_size}
-Dsamjdk.use_async_io_read_samtools={use_async_io_read_samtools}
-Dsamjdk.use_async_io_write_samtools={use_async_io_read_samtools}

Logging:

All tasks for all samples log to ${run_output}/logs.

Notes:

Duplicates are marked but not removed.
Target intervals are assumed to be the same as the bait intervals.
Unique sample settings of bait, target, and reference genomes is easily implemented but not supported yet.
A rule cannot have both dynamic and static outputs so all compressed barcode files go untracked in the DAG.
The basecalls run input directory is hard-coded for the NextSeq platform.
The path to all task wrappers is hard-coded as they have not been pushed to the official Snakemake wrapper repository.
Resources and resource scaling on retry attempt are hard-coded and, at the moment, not configurable through a master configuration file. Resources can be limited with the CLI option --resources.

TODO:

These tasks are not necessary but would increase the reproducibility, portability, and ease-of-use of these pipelines: