Skip to content

Instantly share code, notes, and snippets.

@gdlmx
Last active April 26, 2023 16:55
Show Gist options
  • Save gdlmx/b11751400769ea7cc232ce9f51f1fa57 to your computer and use it in GitHub Desktop.
Save gdlmx/b11751400769ea7cc232ce9f51f1fa57 to your computer and use it in GitHub Desktop.
Scientific workflow management systems

List

Requirements

  1. Dynamic workflow
  2. Environment encapsulation
  3. Graphical editor
  4. HPC batch system support: SGE, LSF, HTCondor, ...
  5. Web integration (curation, sharing, adminitrative tools)

Libraries & Projects

  • Collective Knowledge (CK), an open-source framework and repository to enable sustainable, collaborative and reproducible research and development to share artifacts as reusable and customizable components with a unified Python JSON API; assemble portable and customizable experimental workflows
  • Toil, scalable, efficient, cross-platform and easy-to-use workflow engine in pure Python
  • Dagman, meta-scheduler for HTCondor
  • Makeflow, Makefile like workflow system, HTCondor supported
  • Airflow, Web-based workflow system
  • Nextflow, Data-driven computational pipelines, enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. HTCondor supported.

Systems

  • Apache Airavata, a general purpose workflow management system
  • Apache Taverna, widely used in bioinformatics, astronomy, biodiversity
  • Arvados, a platform for production data science with very large data sets
  • Askalon, Cloud and Grid Application Development and Computing Environment (dead?)
  • Galaxy, Web-based platform (repo) for biomedical research, initially targeted at genomics.
  • Pegasus, Workflow Management System
  • VisTrails, a scientific workflow system developed in Python
  • Cluster Flow, a pipelining tool to automate and standardise bioinformatics analyses on cluster environments.

Standards

  • CWL, Common Workflow Language, a specification for describing analysis workflows and tools in a way that makes them portable and scalable, from workstations to cluster, cloud, and HPC environments. CWL is designed to meet the needs of data-intensive science.
  • GA4GH Task Execution Schema (TES), a standardized schema and API for describing batch execution tasks

Workflow formats / languages

  • Cuneiform, which combines the strong points of functional programming languages, distributed databases, and workflow management systems. It enables you to run your data-intensive analysis workloads with minimal effort and no lock-in.
  • Workflow Description Language (WDL), a human-readable and -writable way to express tasks and workflows. WDL describes tasks with abstract commands which have inputs. Abstract commands are a template with parts of the command left for the user to provide a value for. Once tasks are defined, WDL allows you to construct a workflow of these tasks.
  • YAWL (Yet Another Workflow Language), BPM/Workflow system, based on a concise and powerful modelling language, that handles complex data transformations, and full integration with organizational resources and external Web Services.

Support of different workflow systems

System Format
Taverna Scufl
Askalon AGWL
Pegasus DAX (Directed Acyclic Graph in XML)
CWL YAML
Galaxy JSON

Lectures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment