Skip to content

Instantly share code, notes, and snippets.

@philippbayer
Last active July 28, 2017 03:05
Show Gist options
  • Select an option

  • Save philippbayer/ed9595274c32e2c2d9de2b0e7dd44514 to your computer and use it in GitHub Desktop.

Select an option

Save philippbayer/ed9595274c32e2c2d9de2b0e7dd44514 to your computer and use it in GitHub Desktop.
orthofinder on magnus/zeus/zythos

Install dlcpar:

wget https://www.cs.hmc.edu/~yjw/software/dlcpar/pub/sw/dlcpar-1.0.tar.gz
tar xzvf dlcpar-1.0.tar.gz

Then, inside the new folder dlcpar-1.0, I made a local installation of dlcpar in my home directory using python setup.py install --user

Git clone orthofinder somewhere:

git clone https://github.com/davidemms/OrthoFinder.git

First run (on magnus or zeus)

#!/bin/bash -l

#SBATCH --job-name=orthoF
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=24:00:00
#SBATCH --clusters=zeus
#SBATCH --partition=workq
#SBATCH --account=pawsey0149
# Example --mem=1G
#SBATCH --mem=20G

module load scipy
module load mcl
export PATH=/home/pbayer/.local/bin/:$PATH # has dlcpar
export PATH=`pwd`:$PATH

python OrthoFinder/orthofinder/orthofinder.py -f Sequences/ -op > STDOUT 2> STDERR

The folder Sequences/ contains all proteomes in fasta format, one file per species.

This will prepare all input sequences, run a whole bunch of makeblastdbs, and print all BLAST jobs to be run to the end of STDOUT. Takes a few hours depending on size of input data (I'd say 1 hour with 37 proteomes). Take those commands and put them into a job array.


BIG CAVEATS HERE:

  1. OrthoFinder runs a few test commands at the beginning to see whether anything runs properly, and it stops with an error if the program prints anything to STDERR. Sadly on magnus blast prints this:

    blastp: /usr/lib64/libidn.so.11: no version information available (required by blastp)

It's not a real error but enough for OrthoFinder to stop with an error. To circumvent this you can clone the python version from OrthoFinder's repository and change this code in scripts/util.py:

def CanRunCommand(command, qAllowStderr = False, qPrint = True):

to

def CanRunCommand(command, qAllowStderr = True, qPrint = True):

that'll allow STDERR, like it says.

  1. You should also clean up your input fasta files before running this step. For example, makeblastdb prints a warning for every wrong amino-acid in your input data, so every stop codon encoded as '*' or '.' will trigger one error message. The problem is that OrthoFinder stores all these error messages in memory, so if you have many, the first step will run much, much, much, much longer. So replace stop codons and other weirdnesses irst.

So after your BLAST runs finished you can run it like this on zeus or zythos:

#!/bin/bash -l
#SBATCH --job-name=orthoF
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --time=48:00:00
#SBATCH --clusters=zeus
#SBATCH --partition=zythos
#SBATCH --account=pawsey0149
# Example --mem=1G
#SBATCH --mem=700G

#SBATCH --export=NONE 

echo "========================================="
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NODELIST = $SLURM_NODELIST"
echo "========================================="

module load scipy
module load mcl
export PATH=/home/pbayer/.local/bin/:$PATH
export PATH=`pwd`:$PATH

python OrthoFinder/orthofinder/orthofinder.py -a 16  -b Sequences/Results_Jun07_1/WorkingDirectory/ > STDOUT 2> STDERR

Obviously change that directory I gave in the -b flag, it contains the BLAST results. -a sets the number of CPUs, the more the faster it'll run, but you will run into crashes when running out of memory.


CAVEATS HERE:

  1. I ran into this error message:

davidemms/OrthoFinder#61

Follow the fix of increasing the recursion depth made it work for me.

  1. I ran this with 700GB of memory and 37 proteomes and the phylogeny for the three biggest orthogroups still ran out of memory and crashed with this error message:

. Error: Low memory! nb 14828 size 8

followed by a

Traceback (most recent call last):
  File "/home/pbayer/.local/bin/dlcpar_search", line 210, in <module>
    sys.exit(main())
  File "/home/pbayer/.local/bin/dlcpar_search", line 161, in main
        coal_trees = list(treelib.iter_trees(treefile))
  File "/home/pbayer/.local/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 638, in iter_trees
    infile = util.open_stream(treefile)
  Fiinfile = util.open_stream(treefile)
  File "/home/pbayer/.local/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1171, in open_stream
    stream = open(filename, mode)
IOErstr: [Errno 2] No such file or direct: [Errno 2] No such file or directory: 'Results_Jun07_1/WorkingDirectory/Orthologues_Jun12/WorkingDirectory/dlcpar/OG0000000_tree_id.txt'

I got these two error messages three times for the three biggest groups.

However the orthogroups output was already written at that stage, and the species tree itself looked fine too, as I didn't want those three phylogenies (phylogenies of huge groups are not so useful) I just ignored it and carried on. Lowering the number of threads/runs via the -a flag could have fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment