Probably the most important advice I could give right now:
Think about files first. Files are your core abstraction between pipeline steps. You have many languages already, so enforcing a standard on your inputs and outputs will help you keep existing code working while allowing you to create new code. This will become more apparent as you start to deal with more than 3 generations of Operating Systems, programming languages, and grad students.
In general, you will have 1-3 core file formats, and 3-5 ancillary file formats (which may be unstructured).
- Raw, unprocessed data
- Processed data into a core file format
- Processed, calibrated, and cleaned data (often repeated)
- File types 2-3, but in alternative processing formats for easy processing (e.g. row oriented data to column oriented, time domain to frequency domain, etc...)
Typically include:
- Calibration Data
- Provenance Data (if not included in your core formats 2 and 3) 2a. Logs/Data Quality data
- Configuration files
By the time you get to 3, you typically need provenance to determine how a file has been processed and reprocessed, and it's lineage.
For Provenance, check out PROV-JSON, but storing log files centrally is a good idea too.
Use the following: HDF5 (or NetCDF) Parquet
If using HDF5, try to come up with a structure that works easily for python, C++, and java users. Standardize on that structure where possible.
These are more useful if you need to communicate over a network. In general, they should be avoided for medium to long term data storage UNLESS they are extremely well documented and you can write a validator in a few languages.
Use the Following: Avro Protobuf Thrift JSON/YAML (if using YAML, try not to serialize with language-specific type meta information) XML (with a custom XML Schema Document, but only prefer this if others in your field can agree on a schema, to improve future interoperability)
When necessary, also try to use a database. Prefer Postgres over MySQL. Also prefer Postgres + JSON over MongoDB/CouchDB or other databases. When interacting with a database, try to use python+sqlalchemy as much as possible.
If you don't already, consider using a workflow/pipeline execution framework
Prefer one of the following:
- Airflow
- Pegasus
- https://www.nextflow.io
Related to your field of study and popular:
- https://github.com/nipy/nipype (has support for provenance)
If the previous don't work for you, consider the following:
- https://github.com/scipipe/scipipe/
- https://nifi.apache.org/
- https://github.com/treasure-data/digdag/
You will need to use Docker/Singularity. Your code is already complex and deployment will be impossible in the coming years as you need to migrate to new Operating Systems, libraries and language versions; introduce new packages, and so on. Consider isolating each pipeline step as it's own container, though try to keep all containers built on a common base.
Use docker first, and always export to Singularity. Developing with Docker is much easier on Windows, Mac, and Linux than dealing with Singularity is, but the HPC centers (Sherlock, SLAC) prefer Singularity right now.
Try to come up with a single container that has most of your code installed in it. This will make it easy to add Jupyter to it and get interactive environment to program in.
(Almost Always) Prefer Python Consider Java if you need to touch a network for some reason or you need Spark/Hadoop sorts of things (Scala may be okay, don't use it unless you can get two experts in your group) C if you need to do things at a low level. C++ only if you have to, and even then - only when in limited places.
If Python is too slow, consider the following:
- Consider writing slow Python with Cython.
- Consider wrapping existing C or Fortran libraries with CFFI
- Consider rewriting slow python with C
- As a last resort, consider rewriting slow python with C++
You might ask why. C++ development typically invokes harder questions that, even today, don't have great answers (build system, dependency and library management, slow compilation times if you use lots of templates, boost version hell, C++11 vs C++14 vs C++17, compiler versions) These issues can seriously slow down research, especially for teams composed mostly of grad students.
If you find the need to more generally adopt a faster language, potentially consider Julia (as it bridges some of the strengths of MATLAB and C++ and Python), but only choose Julia if roughly a a third of your students/developers are willing to adopt it.
In general, try to shy away from MATLAB and LabVIEW, since the licensing gets to be tricky unless you are willing to buy many licenses. They are great for prototyping, but they do take work to keep things running smoothly in the long run.
Don't use Go, Rust, or Swift except if you are trying to solve a very specific problem and there's prior art. They really aren't ready for you yet.
Use github.
Use github.
Use github.
Create a Github organization for your group. Github has academic discounts (education.github.com) which can give you a few more features for free.
(Create your organization first, then go here: https://education.github.com/discount_requests/new )
- Try to use as few repos as possible, 1 if you can make it all work.
- Try to use pull requests and code reviews.
- Try to use Travis CI to check your code and build it, if possible.
- Try to come up with a release process which includes tagging your repo(s) and creating container images.
- Try to get two sets of eyes to glance over every line of code that's merged into the repo.
- Try to use Github Issues for development of new features and bug fixes.
Coding is a conversation, and people should be involved in it. Everybody is busy, but everybody will grow to be better developers if everyone can do some fraction of their coding collaboratively.
Use these things when possible:
- flake8
- pytest
- coveralls https://landscape.io/
Format your code in a standard way:
- https://github.com/ambv/black (Python)
- https://github.com/google/google-java-format, https://github.com/diffplug/spotless (Java)
- https://clang.llvm.org/docs/ClangFormatStyleOptions.html (C/C++)
If you can get these to work, consider turning them on in your github repositories, using Travis CI. How that works is that, when you create a pull request, the checks automatically run. You prohibit code from being merged into your master branch until the code checks run. I call these "Seatbelts".
Turn these on for new projects, and try to retrofit old projects for them.
Your students should know markdown, but also Sphinx+reStructuredText.
- pandoc helps a lot! You can convert things to markdown/restructured text/html/latex/etc...
- readthedocs
Each pipeline step may need it's own build process.
Try to use the conventional language native processes as much as possible (setuptools/pytest, maven/gradle, make/CMake/ninja/conan). If that fails, consider a more general tool (Bazel), and if that fails, consider a more flexible tool (SCons).
Warning: Don't go to SCons first! SCons gives you a lot of power, but your build script becomes an application in itself, and must be developed and maintained like one.
Anaconda OS libraries package managers For Python, even if using anaconda, try to use setuptools when possible. Wrap your setuptools scripts for Anaconda packaging (this is well-defined).
Maven (stick with normal Maven pom.xml scripts, or you can try gradle - but pick one).
Do them in Python or Java.
See for example: https://hplgit.github.io/web4sciapps/doc/pub/web4sa_plain_all.html
These are most useful for monitoring things.
Stanford already has Slack for everyone at Stanford and SLAC. If you don't already, create a workspace for your team. Get your PIs in there as well. If you have people at SLAC that can help, you can invite them to your Slack team and discuss questions (this is almost undoubtedly more effective than email). (Requesting a new workspace: https://stanford.service-now.com/it_services?id=kb_article&sys_id=2f5fbdb8db7cdb004a8f75d88c96198f)
Github gists are a good way to communicate code to one another, especially because you can comment on it, and it supports a lot of formats. You can also update the code at any time.