Skip to content

Instantly share code, notes, and snippets.

@matsen
Last active October 1, 2015 12:19
Show Gist options
  • Save matsen/1b6768e02a6e3f9d9f2c to your computer and use it in GitHub Desktop.
Save matsen/1b6768e02a6e3f9d9f2c to your computer and use it in GitHub Desktop.
Response to "Science Drivers Requiring Capable Exascale High Performance Computing" RFI

We can piggyback on the coding development community.

Many good things are happening in open source and industry, and we face many of the same issues that they do. For example, GitHub has provided enormous value to science, both through filling a need and by direct engagement. It has gotten almost unbelievably popular in the computational life sciences. However, other tools such as continuous integration, for example by Travis CI, or containers, for example by Docker, have gotten less traction despite the contributions they could offer to the scientific community.

We need a strategy to fight bit-rot.

"Bit-rot" refers to software/pipelines that become unusable because the underlying dependencies have changed. Sometimes the old versions disappear completely, meaning that old pipelines cannot be reconstructed without digging through the internet archive. Reproducibility is fundamental to science, and thus this problem is acute.

There is a clear antidote to this problem, which are software containers. These are lightweight virtual machines which can run on a variety of platforms. Docker is the most well known, and the community (including Docker) is coalescing under a standard. See https://www.opencontainers.org/ for details. Using containers, all dependencies are cached and the pipeline can be run reliably into the future.

For NIH computational strategies, we need to make sure that any proposed expansion of computing can run software containers. This isn't an entirely trivial technical consideration.

iPlant collaborative is already doing a visionary job.

As you are no doubt aware, the iPlant Collaborative environment (http://www.iplantcollaborative.org/) and the TACC under the leadership of Dan Stanzione is a remarkable example of smart people given a chunk of NSF cash resulting in an excellent product. Their Agave API (http://agaveapi.co/) points the way to the future.

Parallel architectures require parallel algorithms.

This is self-explanatory. In my field of Bayesian phylogenetics, for example, all algorithms in common use utilize Markov chain Monte Carlo, which is an inherently serial algorithm. If we are to use large scale architecture we are going to need algorithms appropriate for that architecture.

Algorithms can give > 100 fold improvement without additional infrastructure.

As we have seen recently with the development of kallisto by Bray et al algorithms can change problems from requiring a cluster to being quite do-able on a laptop. [Note: I understand that running kallisto isn't the same as doing a full analysis with cufflinks, etc, but for many common applications it appears to do a fine job.] Thus, I hope that novel algorithm development for core computational problems will be part of any investment in computing infrastructure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment