Skip to content

Instantly share code, notes, and snippets.

@clintval
Last active December 10, 2019 21:03
Show Gist options
  • Select an option

  • Save clintval/baaf5c1a5543b56876b5d2ae413b27d7 to your computer and use it in GitHub Desktop.

Select an option

Save clintval/baaf5c1a5543b56876b5d2ae413b27d7 to your computer and use it in GitHub Desktop.
A terrible Exome Pipeline in Snakemake

Exome Pipeline

QC Illumina flowcell, demultiplex, and QC mapped BAM

Installation · Usage · Future Improvements


Installation

The main pipeline multi-sample-run-level-pipeline.Snakefile depends on bwa, Picard, Python 3.6+, and R.


Usage

❯ snakemake \
    -s multi-sample-run-level-pipeline.Snakefile \
    --cores ${max_cores} \
    --rerun-incomplete \
    --retry-times ${retry_times} \
    --config \
        run_folder=${run_folder} \
        sample_sheet=${sample_sheet} \
        reference=${reference} \
        bait_intervals=${bait_intervals} \
        run_output=${run_output}

Tool Argument Conversion:

All wrapped tools support on-the-fly Python to tool argument conversion.

For example, the following conversion is applied for a typical Picard tool:

>>> params:
>>>     create_index=True,
>>>     output_extension=None,
>>>     adapters_to_check=['INDEXED', 'FLUIDIGM']
# Represented as the following CLI expansion
"""
picard tool \
    CREATE_INDEX=true \
    OUTPUT_EXTENSION=null \
    ADAPTERS_TO_CHECK=INDEXED
    ADAPTERS_TO_CHECK=FLUIDIGM
"""

A similar example for bwa:

>>> params:
>>>     v=2,
>>>     p=True
# Represented as the following CLI expansion
'bwa mem -p -v2'

DAG:

A graph representation of this pipeline for analyzing one sample:

DAG for one sample

Resources:

All Picard tool support the following Snakemake resource objects. Boolean choices must be represented as an integer binary assignment e.g. use_async_io_read_samtools=1.

  • -XX:GCHeapFreeLimit={gc_heap_free_limit}
  • -XX:GCTimeLimit={gc_time_limit}
  • -Xmx{malloc}m
  • -Dsamjdk.buffer_size={samjdk_buffer_size}
  • -Dsamjdk.use_async_io_read_samtools={use_async_io_read_samtools}
  • -Dsamjdk.use_async_io_write_samtools={use_async_io_read_samtools}

Logging:

All tasks for all samples log to ${run_output}/logs.

Notes:

  • Duplicates are marked but not removed.
  • Target intervals are assumed to be the same as the bait intervals.
  • Unique sample settings of bait, target, and reference genomes is easily implemented but not supported yet.
  • A rule cannot have both dynamic and static outputs so all compressed barcode files go untracked in the DAG.
  • The basecalls run input directory is hard-coded for the NextSeq platform.
  • The path to all task wrappers is hard-coded as they have not been pushed to the official Snakemake wrapper repository.
  • Resources and resource scaling on retry attempt are hard-coded and, at the moment, not configurable through a master configuration file. Resources can be limited with the CLI option --resources.

TODO:

These tasks are not necessary but would increase the reproducibility, portability, and ease-of-use of these pipelines:

  • Use tagged master of snakemake_wrappers GitHub branch
  • Comment assignments and methods of this script (pre-rule definitions)
  • Sub-pipeline pre-alignment, alignment, and post-alignment
  • Mark unmapped and mapped.raw as temp()
  • Abstract away some config settings to a YAML file
  • Use click for a CLI interface to all pipelines
  • Write setup.py installer script
  • Support environment.yml in all wrappers
  • Write tests in all wrappers
  • Lookup bait_intervals, target_intervals, and reference for each sample (flowcells with mixed sample sets)
  • Deploy only using Miniconda and Snakemake
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment