Initial Advice

Probably the most important advice I could give right now:

Think about files first. Files are your core abstraction between pipeline steps. You have many languages already, so enforcing a standard on your inputs and outputs will help you keep existing code working while allowing you to create new code. This will become more apparent as you start to deal with more than 3 generations of Operating Systems, programming languages, and grad students.

Files

In general, you will have 1-3 core file formats, and 3-5 ancillary file formats (which may be unstructured).

Core Formats:

Raw, unprocessed data
Processed data into a core file format
Processed, calibrated, and cleaned data (often repeated)
File types 2-3, but in alternative processing formats for easy processing (e.g. row oriented data to column oriented, time domain to frequency domain, etc...)

Ancillary formats types

Typically include:

Calibration Data
Provenance Data (if not included in your core formats 2 and 3) 2a. Logs/Data Quality data
Configuration files

By the time you get to 3, you typically need provenance to determine how a file has been processed and reprocessed, and it's lineage.

For Provenance, check out PROV-JSON, but storing log files centrally is a good idea too.

Files Formats:

Use the following: HDF5 (or NetCDF) Parquet

If using HDF5, try to come up with a structure that works easily for python, C++, and java users. Standardize on that structure where possible.

Custom Serializations

These are more useful if you need to communicate over a network. In general, they should be avoided for medium to long term data storage UNLESS they are extremely well documented and you can write a validator in a few languages.

Use the Following: Avro Protobuf Thrift JSON/YAML (if using YAML, try not to serialize with language-specific type meta information) XML (with a custom XML Schema Document, but only prefer this if others in your field can agree on a schema, to improve future interoperability)

Databases

When necessary, also try to use a database. Prefer Postgres over MySQL. Also prefer Postgres + JSON over MongoDB/CouchDB or other databases. When interacting with a database, try to use python+sqlalchemy as much as possible.

Pipeline Execution Frameworks

If you don't already, consider using a workflow/pipeline execution framework

Prefer one of the following:

Related to your field of study and popular:

https://github.com/nipy/nipype (has support for provenance)

If the previous don't work for you, consider the following:

Packaging up your Pipeline Steps

You will need to use Docker/Singularity. Your code is already complex and deployment will be impossible in the coming years as you need to migrate to new Operating Systems, libraries and language versions; introduce new packages, and so on. Consider isolating each pipeline step as it's own container, though try to keep all containers built on a common base.

Use docker first, and always export to Singularity. Developing with Docker is much easier on Windows, Mac, and Linux than dealing with Singularity is, but the HPC centers (Sherlock, SLAC) prefer Singularity right now.

Try to come up with a single container that has most of your code installed in it. This will make it easy to add Jupyter to it and get interactive environment to program in.

Languages

(Almost Always) Prefer Python Consider Java if you need to touch a network for some reason or you need Spark/Hadoop sorts of things (Scala may be okay, don't use it unless you can get two experts in your group) C if you need to do things at a low level. C++ only if you have to, and even then - only when in limited places.

If Python is too slow, consider the following:

Consider writing slow Python with Cython.
Consider wrapping existing C or Fortran libraries with CFFI
Consider rewriting slow python with C
As a last resort, consider rewriting slow python with C++

You might ask why. C++ development typically invokes harder questions that, even today, don't have great answers (build system, dependency and library management, slow compilation times if you use lots of templates, boost version hell, C++11 vs C++14 vs C++17, compiler versions) These issues can seriously slow down research, especially for teams composed mostly of grad students.

If you find the need to more generally adopt a faster language, potentially consider Julia (as it bridges some of the strengths of MATLAB and C++ and Python), but only choose Julia if roughly a a third of your students/developers are willing to adopt it.

In general, try to shy away from MATLAB and LabVIEW, since the licensing gets to be tricky unless you are willing to buy many licenses. They are great for prototyping, but they do take work to keep things running smoothly in the long run.

Don't use Go, Rust, or Swift except if you are trying to solve a very specific problem and there's prior art. They really aren't ready for you yet.

Development Processes

Use github.

Create a Github organization for your group. Github has academic discounts (education.github.com) which can give you a few more features for free.

(Create your organization first, then go here: https://education.github.com/discount_requests/new )

Try to use as few repos as possible, 1 if you can make it all work.
Try to use pull requests and code reviews.
Try to use Travis CI to check your code and build it, if possible.
Try to come up with a release process which includes tagging your repo(s) and creating container images.
Try to get two sets of eyes to glance over every line of code that's merged into the repo.
Try to use Github Issues for development of new features and bug fixes.

Coding is a conversation, and people should be involved in it. Everybody is busy, but everybody will grow to be better developers if everyone can do some fraction of their coding collaboratively.

Programming

Python tools

Use these things when possible:

flake8
pytest
coveralls https://landscape.io/

Code Formatting

Format your code in a standard way:

If you can get these to work, consider turning them on in your github repositories, using Travis CI. How that works is that, when you create a pull request, the checks automatically run. You prohibit code from being merged into your master branch until the code checks run. I call these "Seatbelts".

Turn these on for new projects, and try to retrofit old projects for them.

Documentation

Your students should know markdown, but also Sphinx+reStructuredText.

pandoc helps a lot! You can convert things to markdown/restructured text/html/latex/etc...
readthedocs

Building and testing

Each pipeline step may need it's own build process.

Try to use the conventional language native processes as much as possible (setuptools/pytest, maven/gradle, make/CMake/ninja/conan). If that fails, consider a more general tool (Bazel), and if that fails, consider a more flexible tool (SCons).

Warning: Don't go to SCons first! SCons gives you a lot of power, but your build script becomes an application in itself, and must be developed and maintained like one.

Packaging and dependency management

(Python and C/C++)

Anaconda OS libraries package managers For Python, even if using anaconda, try to use setuptools when possible. Wrap your setuptools scripts for Anaconda packaging (this is well-defined).

Java

Maven (stick with normal Maven pom.xml scripts, or you can try gradle - but pick one).

Web Applications:

Do them in Python or Java.

See for example: https://hplgit.github.io/web4sciapps/doc/pub/web4sa_plain_all.html

These are most useful for monitoring things.

Communications

Use Slack!

Stanford already has Slack for everyone at Stanford and SLAC. If you don't already, create a workspace for your team. Get your PIs in there as well. If you have people at SLAC that can help, you can invite them to your Slack team and discuss questions (this is almost undoubtedly more effective than email). (Requesting a new workspace: https://stanford.service-now.com/it_services?id=kb_article&sys_id=2f5fbdb8db7cdb004a8f75d88c96198f)

Github Gists

Github gists are a good way to communicate code to one another, especially because you can comment on it, and it supports a lot of formats. You can also update the code at any time.

brianv0/guidelines.md