Justin C. Bagley, September 5, 2017, Richmond, VA, USA
What I describe here are a series of steps for setting up a Linux supercomputer account for RAD-seq (e.g. ddRAD-seq, 2bRAD) analysis, essentially assuming that you had been handed a new account and were starting from scratch. Part of the narrative is given in first person, reflecting my experiences when doing this recently on the VCU CHiPC's Godel supercomputer; other parts are written in third person as straightforward procedues/advice. In all code examples that follow, "$" is the UNIX/Linux prompt; this was not typed and shouldn't be typed if following along with this Gist. Within code snippets, lines just below a line starting with the prompt, but that do not start with the prompt, are output to screen and likewise also should not be typed as input. The pound sign comments out the remainder of a line, allowing for comments and notes to be added; some of my instrutions are given this way.
Minimally, to remotely or interactively run software applications required for standard genomics or molecular ecology analyses of RAD-seq data, around 10 types of software and related packages need to be set up (or at least checked) on the supercomputer:
- Appropriate directory structure and computing environment (I only discuss directory setup here, but
$PATH
and other variables also need to be set and sourced from~/.bashrc
or~/.bash\_profile
) - Compiler(s)* (
GCC
(GNU Compiler Collection),clang
, others) Python
and its scientific computing tools*, in this case using Miniconda (but you could also install the much bigger Anaconda release), which includes core packages like SciPy and NumPy- Important Python packages (dependencies and genetic/bio analysis packages, up to the user).
Perl
*Java
*, with multiple JDK and JRE releases available to the userR
environment*- Private and public keys for passwordless SSH access
- RAD-seq assemblers (my two favorites are
pyRAD
(Eaton 2014) oripyrad
(Eaton and Overcast 2016) anddDocent
(Puritz et al. 2014), so those are the ones I discuss installing herein). - Other non-Python-based software for genomic / population genetic analyses
\* = Usually installed on Linux supercomputers for all users; however, I prefer to use local installs in most cases.
Setup $HOME environment (installation locations)
First, create the standard set of places, including ~/local
and ~/local/bin
directories, to install to in your home folder:
$ mkdir ~/local
$ mkdir ~/local/bin
$ mkdir ~/local/include
$ mkdir ~/local/lib
$ mkdir ~/local/opt
$ mkdir ~/local/scripts
$ mkdir ~/local/share
Fix ~/.bashrc
file (instead of ~/.bash_profile
) so it points to install locations
Open ~/.bashrc
in nano or another editor (e.g. nano ~/.bashrc
) and add ~/local/bin
to your $PATH
environmental variable, and then sourcing bashrc (source ~/.bashrc
). If your Linux setup is exactly the same as mine and you've been following this Gist exactly as per above, then you can do this by adding the following lines to your ~/.bashrc
file:
export PATH=$PATH:/home/jcbagley/local
export PATH=$PATH:/home/jcbagley/local/bin
export PATH=$PATH:/home/jcbagley/local/scripts
Other stuff to look into and possibly try later (ignore for now!). URLs:
- https://wiki.gentoo.org/wiki/Project:Prefix
- https://gobolinux.org/index.html#content
- http://linuxbrew.sh/
Linux compilers will already be installed but you can do a local install of GCC
at least. Official GCC install docomentation can be found here. However, here is some code for doing a local install (in this example, of clang 4.6.2 for C, C++, fortran, and go) taken from the GNU GCC Wiki:
# First download the compressed tarball of the GCC distro (*.tar.gz) file from the GCC website, then do:
$ tar xzf gcc-4.6.2.tar.gz # unzip the tarball
$ cd gcc-4.6.2
$ ./contrib/download_prerequisites
$ cd ..
$ mkdir objdir
$ cd objdir
$ $PWD/../gcc-4.6.2/configure --prefix=$HOME/GCC-4.6.2 --enable-languages=c,c++,fortran,go # Here down: configure, make, install.
$ make
$ make install
For Miniconda, Anaconda, BEAST1 and BEAST2, BEAGLE lib, and other software, I needed a version of at least Python v2.7++ (if not v3.5++); however, VCU's Godel supercomputer comes equipped with Python v2.4.8 on CentOS v5.1.1, and this was a problem for me. So, I downloaded two suitable versions of Python 2.7 to get the interpreter on my system. I recommend that all users do local installs of Python 2.7, as this is very useful and more customizable than the global installs that will be present on the supercomputer already.
For Python 2.7.13
$ cd ~
$ wget https://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
$ tar -xvzf Python-2.7.2.tgz
$ cd Python-2.7.2
$ ./configure --prefix=$HOME/local
$ make && make install
For Python 2.7.2:
$ cd ~
$ wget https://www.python.org/ftp/python/2.7.13/Python-2.7.13.tgz
$ tar -xvzf Python-2.7.13.tgz
$ cd Python-2.7.13
$ ./configure --prefix=$HOME/local
$ make && make install
Use alias to fix call to python2.7
instead of any other python versions installed
This applies when higher or lower versions come installed by default. In my case, our supercomputer, Godel, comes with python v2.4.8, so I need to set a python alias that points to v2.7. To make the alias, open the ~/.bashrc
file and the line alias python="python2.7"
.
After conducting the above steps, you will now have an interpreter etc. for Python v2.7. Next, you will need pycosat
and PyYAML
to make sure you have everything in terms of dependencies prior to installing Miniconda. Here's what I have personally tried for these:
Downloading and installing pycosat
from https://pypi.python.org/pypi/pycosat:
$ cd ~
$ wget https://pypi.python.org/packages/76/0f/16edae7bc75b79376f2c260b7a459829785f08e463ecf74a8ccdef62dd4a/pycosat-0.6.1.tar.gz#md5=c1fc35b17865f5f992595ae0362f9f9f --no-check-certificate
$ tar -xvzf pycosat-0.6.1.tar.gz
$ cd pycosat-0.6.1
$ python setup.py install --prefix=$HOME/local
Downloading and installing PyYAML
from https://pypi.python.org/pypi/PyYAML:
$ cd ~
$ wget https://pypi.python.org/packages/4a/85/db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a/PyYAML-3.12.tar.gz#md5=4c129761b661d181ebf7ff4eb2d79950 --no-check-certificate
$ tar -xvzf PyYAML-3.12.tar.gz
$ cd PyYAML-3.12
$ python setup.py install --prefix=$HOME/local
Note: Here is some legacy Python module installation information that will be helpful to any/all beginners: https://docs.python.org/2/install/index.html
What follows are instructions consistent with the "silent install" option for installing miniconda2
(https://conda.io/docs/user-guide/install/linux.html#install-linux-silent).
$ cd ~
$ wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
$ bash Miniconda2-latest-Linux-x86_64.sh
# Next, hit <ENTER>, then continue holding the <ENTER> key to scroll down through the license agreement...
# Then when you get to the end, answer "yes" to the "Do you approve the license terms? [yes|no]..." prompt.
# The next prompt will say "Miniconda2 will now be installed into this location:". If the listed location
# is OK with you, then press ENTER again.
# As Miniconda was installing, it told me the name of the PREFIX it used; this should match the location
# listed just above.
After following these steps, Miniconda will echo lots of information to screen, for example:
installing: python-2.7.13-0 ...
installing: asn1crypto-0.22.0-py27_0 ...
installing: cffi-1.10.0-py27_0 ...
installing: conda-env-2.6.0-0 ...
installing: cryptography-1.8.1-py27_0 ...
installing: enum34-1.1.6-py27_0 ...
installing: idna-2.5-py27_0 ...
installing: ipaddress-1.0.18-py27_0 ...
installing: libffi-3.2.1-1 ...
installing: openssl-1.0.2l-0 ...
installing: packaging-16.8-py27_0 ...
installing: pycosat-0.6.2-py27_0 ...
installing: pycparser-2.17-py27_0 ...
installing: pyopenssl-17.0.0-py27_0 ...
installing: pyparsing-2.1.4-py27_0 ...
installing: readline-6.2-2 ...
installing: requests-2.14.2-py27_0 ...
installing: ruamel_yaml-0.11.14-py27_1 ...
installing: setuptools-27.2.0-py27_0 ...
installing: six-1.10.0-py27_0 ...
installing: sqlite-3.13.0-0 ...
.
.
.
## (so on...)
This is normal, and installing all of these parts of conda will take time. So, now would be a good moment to go make a coffe or take a break!
Next, add location of miniconda2
to your $PATH
environmental variable. I did this by adding the following lines to ~/.bashrc
on my account and then sourcing ~/.bashrc
:
# added for Miniconda2 4.3.21
export PATH="/home/jcbagley/miniconda2/bin:$PATH"
Next, reboot your command line ssh session, or open another Terminal/cli window and login to the supercomputing account again, before proceeding.
The first things you should do with conda
are check your version and run conda update conda
to make sure everything is up to date.
Stop now and take a few moments to look up, download, and install any other Python
packages that you are interested in using now, before moving on to the next step.
Perl
is currently up to version 5++ and can be downloaded here if needed; however, downloading and installing Perl
is likely to be unnecessary, given that it is virtually always installed on supercomputing clusters and made available for all users. We should, however, check the presence, location, and version of Perl
on the system. Do this by logging in and typing which perl
. This should return the path /usr/bin/perl
or something very similar to screen. If nothing is output to screen, then Perl
is not installed so you will have to install it for your user account. First, download the latest source code here, then follow typical Linux installation (unzip, configure, make, install). Alternatively, you can do this using curl
by opening a terminal and typing curl -L http://xrl.us/installperlnix | bash
(as advised here).
Java will probably already be installed on the supercomputer but, as with our dealings with Python above, you will benefit from having complete control over Java by adding installs specific to your user account and controlling them with code placed in your ~/.bashrc
or ~/.bash\_profile
files. See instructions for downloading and installing Java JDKs and JREs for Linux here.
The R
environment should be installed already, and in the case of R
a global install will be fine because packages will be set up to automatically be installed locally to your user account if R
was set up correctly. You should check this by going logging into the supercomputer from a terminal, typing $ R
and pressing enter. If R
runs, great. Now attmpt to use install.packages("ape")
to install the APE phylogenetics package, and see if this successfully results in a local install (if APE is already installed, obviously try a different package for this test). If R
is not found or packages cannot be installed, do your own local install of R
.
The following steps setup passwordless ssh access, allowing you to ssh onto your supercomputer account or specific nodes without entering your username and password. This is critical for hassle-free interactions such as remotely queuing runs or doing secure copy. In code examples below, replace "" with the username associated with your account on the supercomputer host.
- Make public and private keys on user's local Mac osSierra/UNIX machine:
$ ssh-keygen -t rsa -b 2048
- Create the
~/.ssh
directory in$HOME
on supercomputer account:$ cd ~; mkdir .ssh
- Use secure copy to move user's public key from Mac into the
authorized_keys
file on supercomputer (in my case,/home/<username>/.ssh
)$ scp ~/.ssh/godel_rsa.pub <username>[email protected]:.ssh/temp.pub
- Login to supercomputer account, change authorized_keys file permissions, then cat your public key to add it to authorized_keys
$ ssh <username>@godel.vcu.edu # login to supercomputer
$ chmod u+w ~/.ssh/authorized_keys # read + write permissions
$ cat ~/.ssh/temp.pub >> ~/.ssh/authorized_keys
- On my machine, all supercomputers that have been logged into through the command line are automatically added to
~/.ssh/known\_hosts
; however, open this file to make sure that a host line (often the last line) starting with the name of the supercomputer system or node is present.- open the ~/.bashrc file with
$ nano ~/.ssh/known\_hosts
- open the ~/.bashrc file with
- Set alias for login, so that only one word needs to be typed at the command line to access the supercomputer account:
- open the local
~/.bash\_profile
file and the linealias godel="<username>@godel.vcu.edu"
- open the local
Here is a list of preferred online tutorials for setting up passwordless ssh access from Mac OS X to a remote, Linux-based supercomputing cluster:
- https://www.msi.umn.edu/support/faq/how-do-i-setup-ssh-keys
- https://coolestguidesontheplanet.com/make-passwordless-ssh-connection-osx-10-9-mavericks-linux/
- https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/
If you're going to be analyzing RAD-seq data, then of course you need to install genomic/SNP data assemblers capable of creating assemblies of contigs and doing SNP calling and data filtering. I recommend that you install two of my favories, (1) pyRAD
(Eaton 2014) or ipyrad
(Eaton and Overcast 2016) and (2) dDocent
(Puritz et al. 2014).
Now that you have conda
installed and running, it is possible to quickly and easily install a range of software, perhaps the main program of which I am interested in is pyRAD (Eaton 2014) or ipyrad (Eaton and Overcast 2017).
Here are the Linux instructions for installing ipyrad
using conda
, copied verbatim from here:
$ conda install -c ipyrad ipyrad ## installs the latest release
Running this will list a large number of new packages that need to be installed and ask you if you want to install them. Confirm you wish to proceed by typing "y", then wait for all of the installs to finish. To me, it is most notable that, on first install on a clean supercomputer account (where we expect little to be available a priori), this installs many important packages/libraries such as libgcc*
, libgfortran
, libiconv
, mpich2
, numpy
, sphinx
, ipython
, and jupyter
. The last of these will be ipyrad, as follows:
.
.
.
pyqt-5.6.0-py2 100% |###########################################################################################################################| Time: 0:00:01 4.67 MB/s
ipyparallel-6. 100% |###########################################################################################################################| Time: 0:00:00 4.73 MB/s
jupyter_consol 100% |###########################################################################################################################| Time: 0:00:00 1.04 MB/s
notebook-5.0.0 100% |###########################################################################################################################| Time: 0:00:01 4.27 MB/s
qtconsole-4.3. 100% |###########################################################################################################################| Time: 0:00:00 4.79 MB/s
widgetsnbexten 100% |###########################################################################################################################| Time: 0:00:00 9.80 MB/s
ipywidgets-6.0 100% |###########################################################################################################################| Time: 0:00:00 1.46 MB/s
jupyter-1.0.0- 100% |###########################################################################################################################| Time: 0:00:00 117.04 kB/s
ipyrad-0.7.13- 100% |###########################################################################################################################| Time: 0:00:06 3.28 MB/s
Now, with python2.7
and conda
installed, we can also relatively easily install dDocent
(Puritz et al. 2014). Below, I provide instructions for doing this, copied essentially verbatim from the Bioconda
install methods described here.
## Add the bioconda channel:
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels r
conda config --add channels bioconda
## Create a dDocent conda environment:
conda create -n ddocent_env ddocent
## Activate the dDocent environment:
source activate ddocent_env
## Run dDocent:
dDocent
## Close the environment when you’re done:
source deactivate
Last, install any other non-Python
software that you think you'll need. Of course, this is totally up to you, the user.
Good luck! E-mail me at jcbagley (at) vcu.edu if you have any questions.
Cheers ~J
- Eaton DA (2014) PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30, 1844-1849.
- Eaton DAR, Overcast I (2016) ipyrad: interactive assembly and analysis of RADseq data sets. Available at: http://ipyrad.readthedocs.io/.
- Puritz JB, Hollenbeck CM, Gold JR (2014) dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms. PeerJ, 2, e431.