How To Setup A Linux Supercomputer Account for RAD-seq Analysis

Justin C. Bagley, September 5, 2017, Richmond, VA, USA

What I describe here are a series of steps for setting up a Linux supercomputer account for RAD-seq (e.g. ddRAD-seq, 2bRAD) analysis, essentially assuming that you had been handed a new account and were starting from scratch. Part of the narrative is given in first person, reflecting my experiences when doing this recently on the VCU CHiPC's Godel supercomputer; other parts are written in third person as straightforward procedues/advice. In all code examples that follow, "$" is the UNIX/Linux prompt; this was not typed and shouldn't be typed if following along with this Gist. Within code snippets, lines just below a line starting with the prompt, but that do not start with the prompt, are output to screen and likewise also should not be typed as input. The pound sign comments out the remainder of a line, allowing for comments and notes to be added; some of my instrutions are given this way.

Minimally, to remotely or interactively run software applications required for standard genomics or molecular ecology analyses of RAD-seq data, around 10 types of software and related packages need to be set up (or at least checked) on the supercomputer:

Appropriate directory structure and computing environment (I only discuss directory setup here, but $PATH and other variables also need to be set and sourced from ~/.bashrc or ~/.bash\_profile)
Compiler(s)* (GCC (GNU Compiler Collection), clang, others)
Python and its scientific computing tools*, in this case using Miniconda (but you could also install the much bigger Anaconda release), which includes core packages like SciPy and NumPy
Important Python packages (dependencies and genetic/bio analysis packages, up to the user).
Perl*
Java*, with multiple JDK and JRE releases available to the user
R environment*
Private and public keys for passwordless SSH access
RAD-seq assemblers (my two favorites are pyRAD (Eaton 2014) or ipyrad (Eaton and Overcast 2016) and dDocent (Puritz et al. 2014), so those are the ones I discuss installing herein).
Other non-Python-based software for genomic / population genetic analyses

\* = Usually installed on Linux supercomputers for all users; however, I prefer to use local installs in most cases.

1. DIRECTORY STRUCTURE SETUP

`$HOME` environment

Setup $HOME environment (installation locations)

First, create the standard set of places, including ~/local and ~/local/bin directories, to install to in your home folder:

$ mkdir ~/local
$ mkdir ~/local/bin
$ mkdir ~/local/include
$ mkdir ~/local/lib
$ mkdir ~/local/opt
$ mkdir ~/local/scripts
$ mkdir ~/local/share

Fix ~/.bashrc file (instead of ~/.bash_profile) so it points to install locations

Open ~/.bashrc in nano or another editor (e.g. nano ~/.bashrc) and add ~/local/bin to your $PATH environmental variable, and then sourcing bashrc (source ~/.bashrc). If your Linux setup is exactly the same as mine and you've been following this Gist exactly as per above, then you can do this by adding the following lines to your ~/.bashrc file:

export PATH=$PATH:/home/jcbagley/local
export PATH=$PATH:/home/jcbagley/local/bin
export PATH=$PATH:/home/jcbagley/local/scripts

Other stuff to look into and possibly try later (ignore for now!). URLs:

2. COMPILERS

Linux compilers will already be installed but you can do a local install of GCC at least. Official GCC install docomentation can be found here. However, here is some code for doing a local install (in this example, of clang 4.6.2 for C, C++, fortran, and go) taken from the GNU GCC Wiki:

# First download the compressed tarball of the GCC distro (*.tar.gz) file from the GCC website, then do:
$ tar xzf gcc-4.6.2.tar.gz	# unzip the tarball
$ cd gcc-4.6.2
$ ./contrib/download_prerequisites
$ cd ..
$ mkdir objdir
$ cd objdir
$ $PWD/../gcc-4.6.2/configure --prefix=$HOME/GCC-4.6.2 --enable-languages=c,c++,fortran,go	# Here down: configure, make, install.
$ make
$ make install

3. PYTHON v2.7 AND MINICONDA

For Miniconda, Anaconda, BEAST1 and BEAST2, BEAGLE lib, and other software, I needed a version of at least Python v2.7++ (if not v3.5++); however, VCU's Godel supercomputer comes equipped with Python v2.4.8 on CentOS v5.1.1, and this was a problem for me. So, I downloaded two suitable versions of Python 2.7 to get the interpreter on my system. I recommend that all users do local installs of Python 2.7, as this is very useful and more customizable than the global installs that will be present on the supercomputer already.

For Python 2.7.13

$ cd ~
$ wget https://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
$ tar -xvzf Python-2.7.2.tgz
$ cd Python-2.7.2
$ ./configure --prefix=$HOME/local
$ make && make install

For Python 2.7.2:

$ cd ~
$ wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
$ tar -xvzf Python-2.7.13.tgz
$ cd Python-2.7.13
$ ./configure --prefix=$HOME/local
$ make && make install

Use alias to fix call to python2.7 instead of any other python versions installed

This applies when higher or lower versions come installed by default. In my case, our supercomputer, Godel, comes with python v2.4.8, so I need to set a python alias that points to v2.7. To make the alias, open the ~/.bashrc file and the line alias python="python2.7".

MINICONDA

DEPENDENCIES

After conducting the above steps, you will now have an interpreter etc. for Python v2.7. Next, you will need pycosat and PyYAML to make sure you have everything in terms of dependencies prior to installing Miniconda. Here's what I have personally tried for these:

Downloading and installing pycosat from https://pypi.python.org/pypi/pycosat:

$ cd ~
$ wget https://pypi.python.org/packages/76/0f/16edae7bc75b79376f2c260b7a459829785f08e463ecf74a8ccdef62dd4a/pycosat-0.6.1.tar.gz#md5=c1fc35b17865f5f992595ae0362f9f9f --no-check-certificate
$ tar -xvzf pycosat-0.6.1.tar.gz
$ cd pycosat-0.6.1
$ python setup.py install --prefix=$HOME/local

Downloading and installing PyYAML from https://pypi.python.org/pypi/PyYAML:

$ cd ~
$ wget https://pypi.python.org/packages/4a/85/db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a/PyYAML-3.12.tar.gz#md5=4c129761b661d181ebf7ff4eb2d79950 --no-check-certificate
$ tar -xvzf PyYAML-3.12.tar.gz
$ cd PyYAML-3.12
$ python setup.py install --prefix=$HOME/local

Note: Here is some legacy Python module installation information that will be helpful to any/all beginners: https://docs.python.org/2/install/index.html

INSTALLING MINICONDA

What follows are instructions consistent with the "silent install" option for installing miniconda2 (https://conda.io/docs/user-guide/install/linux.html#install-linux-silent).

$ cd ~
$ wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
$ bash Miniconda2-latest-Linux-x86_64.sh

# Next, hit <ENTER>, then continue holding the <ENTER> key to scroll down through the license agreement...
# Then when you get to the end, answer "yes" to the "Do you approve the license terms? [yes|no]..." prompt.
# The next prompt will say "Miniconda2 will now be installed into this location:". If the listed location
# is OK with you, then press ENTER again.
# As Miniconda was installing, it told me the name of the PREFIX it used; this should match the location
# listed just above.

After following these steps, Miniconda will echo lots of information to screen, for example:

installing: python-2.7.13-0 ...
installing: asn1crypto-0.22.0-py27_0 ...
installing: cffi-1.10.0-py27_0 ...
installing: conda-env-2.6.0-0 ...
installing: cryptography-1.8.1-py27_0 ...
installing: enum34-1.1.6-py27_0 ...
installing: idna-2.5-py27_0 ...
installing: ipaddress-1.0.18-py27_0 ...
installing: libffi-3.2.1-1 ...
installing: openssl-1.0.2l-0 ...
installing: packaging-16.8-py27_0 ...
installing: pycosat-0.6.2-py27_0 ...
installing: pycparser-2.17-py27_0 ...
installing: pyopenssl-17.0.0-py27_0 ...
installing: pyparsing-2.1.4-py27_0 ...
installing: readline-6.2-2 ...
installing: requests-2.14.2-py27_0 ...
installing: ruamel_yaml-0.11.14-py27_1 ...
installing: setuptools-27.2.0-py27_0 ...
installing: six-1.10.0-py27_0 ...
installing: sqlite-3.13.0-0 ...
.
.
.
## (so on...)

This is normal, and installing all of these parts of conda will take time. So, now would be a good moment to go make a coffe or take a break!

Next, add location of miniconda2 to your $PATH environmental variable. I did this by adding the following lines to ~/.bashrc on my account and then sourcing ~/.bashrc:

# added for Miniconda2 4.3.21         
export PATH="/home/jcbagley/miniconda2/bin:$PATH"

Next, reboot your command line ssh session, or open another Terminal/cli window and login to the supercomputing account again, before proceeding.

The first things you should do with conda are check your version and run conda update conda to make sure everything is up to date.

4. OTHER PYTHON PACKAGES

Stop now and take a few moments to look up, download, and install any other Python packages that you are interested in using now, before moving on to the next step.

5. PERL

Perl is currently up to version 5++ and can be downloaded here if needed; however, downloading and installing Perl is likely to be unnecessary, given that it is virtually always installed on supercomputing clusters and made available for all users. We should, however, check the presence, location, and version of Perl on the system. Do this by logging in and typing which perl. This should return the path /usr/bin/perl or something very similar to screen. If nothing is output to screen, then Perl is not installed so you will have to install it for your user account. First, download the latest source code here, then follow typical Linux installation (unzip, configure, make, install). Alternatively, you can do this using curl by opening a terminal and typing curl -L http://xrl.us/installperlnix | bash (as advised here).

6. JAVA

Java will probably already be installed on the supercomputer but, as with our dealings with Python above, you will benefit from having complete control over Java by adding installs specific to your user account and controlling them with code placed in your ~/.bashrc or ~/.bash\_profile files. See instructions for downloading and installing Java JDKs and JREs for Linux here.

7. R ENVIRONMENT

The R environment should be installed already, and in the case of R a global install will be fine because packages will be set up to automatically be installed locally to your user account if R was set up correctly. You should check this by going logging into the supercomputer from a terminal, typing $ R and pressing enter. If R runs, great. Now attmpt to use install.packages("ape") to install the APE phylogenetics package, and see if this successfully results in a local install (if APE is already installed, obviously try a different package for this test). If R is not found or packages cannot be installed, do your own local install of R.

8. PASSWORDLESS SSH ACCESS

The following steps setup passwordless ssh access, allowing you to ssh onto your supercomputer account or specific nodes without entering your username and password. This is critical for hassle-free interactions such as remotely queuing runs or doing secure copy. In code examples below, replace "" with the username associated with your account on the supercomputer host.

Make public and private keys on user's local Mac osSierra/UNIX machine:
- $ ssh-keygen -t rsa -b 2048
Create the ~/.ssh directory in $HOME on supercomputer account:
- $ cd ~; mkdir .ssh
Use secure copy to move user's public key from Mac into the authorized_keys file on supercomputer (in my case, /home/<username>/.ssh)
- $ scp ~/.ssh/godel_rsa.pub <username>[email protected]:.ssh/temp.pub
Login to supercomputer account, change authorized_keys file permissions, then cat your public key to add it to authorized_keys
- $ ssh <username>@godel.vcu.edu # login to supercomputer
- $ chmod u+w ~/.ssh/authorized_keys # read + write permissions
- $ cat ~/.ssh/temp.pub >> ~/.ssh/authorized_keys
On my machine, all supercomputers that have been logged into through the command line are automatically added to ~/.ssh/known\_hosts; however, open this file to make sure that a host line (often the last line) starting with the name of the supercomputer system or node is present.
- open the ~/.bashrc file with $ nano ~/.ssh/known\_hosts
Set alias for login, so that only one word needs to be typed at the command line to access the supercomputer account:
- open the local ~/.bash\_profile file and the line alias godel="<username>@godel.vcu.edu"

Online Resources:

Here is a list of preferred online tutorials for setting up passwordless ssh access from Mac OS X to a remote, Linux-based supercomputing cluster:

9. ASSEMBLERS

If you're going to be analyzing RAD-seq data, then of course you need to install genomic/SNP data assemblers capable of creating assemblies of contigs and doing SNP calling and data filtering. I recommend that you install two of my favories, (1) pyRAD (Eaton 2014) or ipyrad (Eaton and Overcast 2016) and (2) dDocent (Puritz et al. 2014).

PYRAD / IPYRAD

Now that you have conda installed and running, it is possible to quickly and easily install a range of software, perhaps the main program of which I am interested in is pyRAD (Eaton 2014) or ipyrad (Eaton and Overcast 2017).

Here are the Linux instructions for installing ipyrad using conda, copied verbatim from here:

$ conda install -c ipyrad ipyrad     ## installs the latest release

Running this will list a large number of new packages that need to be installed and ask you if you want to install them. Confirm you wish to proceed by typing "y", then wait for all of the installs to finish. To me, it is most notable that, on first install on a clean supercomputer account (where we expect little to be available a priori), this installs many important packages/libraries such as libgcc*, libgfortran, libiconv, mpich2, numpy, sphinx, ipython, and jupyter. The last of these will be ipyrad, as follows:

.
.
.
pyqt-5.6.0-py2 100% |###########################################################################################################################| Time: 0:00:01   4.67 MB/s
ipyparallel-6. 100% |###########################################################################################################################| Time: 0:00:00   4.73 MB/s
jupyter_consol 100% |###########################################################################################################################| Time: 0:00:00   1.04 MB/s
notebook-5.0.0 100% |###########################################################################################################################| Time: 0:00:01   4.27 MB/s
qtconsole-4.3. 100% |###########################################################################################################################| Time: 0:00:00   4.79 MB/s
widgetsnbexten 100% |###########################################################################################################################| Time: 0:00:00   9.80 MB/s
ipywidgets-6.0 100% |###########################################################################################################################| Time: 0:00:00   1.46 MB/s
jupyter-1.0.0- 100% |###########################################################################################################################| Time: 0:00:00 117.04 kB/s
ipyrad-0.7.13- 100% |###########################################################################################################################| Time: 0:00:06   3.28 MB/s

DDOCENT

Now, with python2.7 and conda installed, we can also relatively easily install dDocent (Puritz et al. 2014). Below, I provide instructions for doing this, copied essentially verbatim from the Bioconda install methods described here.

## Add the bioconda channel:
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels r
conda config --add channels bioconda

## Create a dDocent conda environment:
conda create -n ddocent_env ddocent

## Activate the dDocent environment:
source activate ddocent_env

## Run dDocent:
dDocent

## Close the environment when you’re done:
source deactivate

10. OTHER NON-PYTHON-BASED SOFTWARE

Last, install any other non-Python software that you think you'll need. Of course, this is totally up to you, the user.

Good luck! E-mail me at jcbagley (at) vcu.edu if you have any questions.

Cheers ~J

REFERENCES

Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30, 1844-1849.
Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. Available at: http://ipyrad.readthedocs.io/.
Puritz JB, Hollenbeck CM, Gold JR (2014) dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ, 2, e431.

justincbagley/How_To_Setup_A_Supercomputer_Account_for_RAD-seq.md