OrthoMCL
is the leading piece of software for inferring orthologs across several organisms. In this tutorial I will provide detailed instructions for running a set of protein annotations through OrthoMCL
.
OrthoMCL
, and it's dependencies, must be installed. Detailed information on this tool and its installation can be found here. I actually used a slightly modified version ofOrthoMCL
that was made available by the author of theorthomcl-pipeline
(see below). There isn't much details on the ways this is different from the existingOrthoMCL
, but this is available here.orthmcl-pipeline
must also be installed, as this is how we will automate theOrthoMCL
process. Detailed information on this tool and its installation can be found here. There is also a nice overview of the processes we'll be essentially going through below located here.- Protein sequences from desired organisms must also be assembled. In this example, I am using a de novo annotation that I produced for one species and am combining these data with protein annotations retrieved from NCBI for 4 other species, as follows:
- Boa constrictor - My own annotation produced from MAKER. If wanting to recreate this analysis, you can leave this sample off and run with the others.
- Anolis carolinensis
- Python molurus
- Thamnophis sirtalis
- Protobothrops mucrosquamatus
We will be specifically downloading the protein sequences in FASTA format (see banner at top of the page) and will also follow the link to see the tabular annotation data, which we can download on the next page (you'll see why in a minute).
We will be using a MySQL database to store data when running OrthoMCL
, so we need to set this up for the first time. First, we must change the default storage engine used by MySQL
to be myisam
and not innodb
. This is because of an error that may appear later that is documented here. Note that I tried increasing the buffer size for innodb
, as documented on that page, but that this didn't work. The details of this aren't important, but myisam
works based on my limited experience. To do so, open the my.cnf
file (stored in /etc/mysql/
on Linux systems) that stores settings for MySQL
and append the following lines beneath the [mysqld]
tag.
#
# * Set database engine
#
default-storage-engine = myisam
It is easy to make this change by running nano
with root access.
nano /etc/mysql/my.cnf`
This has the effect of making myisam
the default storage engine for any new database. We must restart MySQL
so that this change can take effect, which can be done by issuing a command like this:
sudo service mysql restart
Now we should create a new MySQL
user so that OrthoMCL
can create the desired database. To do this, we must log into MySQL
as the root
user.
mysql -u root
# a password usually isn't necessary if you've never used mysql
Now we can create a generic user by issuing the following command at the MySQL
prompt. I just used these simple (but unsecure) settings, but feel free to customize the username (orthomcl
) and password (password
) as desired.
CREATE USER 'orthomcl'@'localhost' IDENTIFIED BY 'password';
You can issue the command exit
to get out of MySQL
. Now let's create the database we would like to write our OrthoMCL
data to. First make sure you are in the desired working directory for the downstream MySQL
run. Now we can use a script from orthomcl-pipeline
to automatically create the database. In this command we are supplying the username and password we created above and also naming the database (I'm using the generic name orthomcl
but customize as desired). We just also write out settings to a file that will be used again later using --outfile
.
orthomcl-setup-database.pl --user orthomcl --password password --host localhost --database orthomcl --outfile orthomcl.conf
Now let's do one more step to make sure that OrthoMCL
will have the proper permissions it needs to write the data it generates to our orthomcl
database.
mysql -u root
You can now make sure our orthomcl
user has full privileges to this database.
use orthomcl;
GRANT ALL PRIVILEGES ON orthomcl.* TO 'orthomcl'@'localhost' IDENTIFIED BY 'password';
It is also worth checking to make sure our database has the correct storage engine set (myisam
).
SELECT TABLE_NAME, ENGINE FROM information_schema.TABLES where TABLE_SCHEMA = 'database';
Our database is now ready for us to run OrthoMCL
.
While the orthomcl-pipeline
makes running OrthoMCL
pretty straight-forward, we first need to prepare our data a little bit. First, the headers for the FASTA protein sequences currently aren't conducive to running OrthoMCL
. We instead want them in the format organism_ID|protein_ID
, where organism_ID
is a short ID that is unique to each FASTA file and protein_ID
is the protein ID set by NCBI (usually has a NP_
or XP_
prefix) or by MAKER. We can easily do this using bioawk
. Here is how we would do this using our Anolis proteins.
bioawk -c fastx '{ print ">Acar|"$name; print $seq }' GCF_000090745.1_AnoCar2.0_protein.faa
Now or organism_ID
is always a short, easy to discern Acar
and the protein_ID
is the first part of the existing FASTA header before the first space (i.e., the NP_
portion).
Beyond formatting the FASTA headers, we also want to filter the protein data further. Let's only work with longer proteins of 50 amino acids or longer. We can use seqkit
to easily filter our protein FASTA files, which excludes some proteins.
seqkit seq -m 50 GCF_000090745.1_AnoCar2.0_protein.faa
Finally, in the case of NCBI proteins (and likely proteins from other sources), we will have some redundancy in the form of different isoforms for the same gene. We would like to collapse this down and only keep a single isoform for each gene for the OrthoMCL
analysis. This is where the tabular annotation data from NCBI will come into play. The following command will extract the longest protein product (i.e., isoform) for each distinct Gene ID to produce a new, somewhat smaller protein FASTA file with no redundancy.
tail -n +2 GCF_000090745.1_AnoCar2.0_protein_table.txt | sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | seqkit grep -f - GCF_000090745.1_AnoCar2.0_protein.faa
We can easily pipe all of these commands together to produced the processed FASTA protein files that we need to run OrthoMCL
. Let's say we have our protein FASTA files (i.e., _protein.faa
) and our annotation data (stored with a _protein_table.txt
suffix) stored in a directory called raw
in our working directory. OrthoMCL
will run on all FASTA files in a specified directory, so let's write our processed protein FASTAs to a new directory called processed
with the extension .fasta
(required by OrthoMCL
).
# directory structure
--/path/to/working/directory
|
-----raw/
|
-----processed/
# example with Anolis
tail -n +2 raw/GCF_000090745.1_AnoCar2.0_protein_table.txt | \
sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | \
seqkit grep -f - raw/GCF_000090745.1_AnoCar2.0_protein.faa | \
seqkit seq -m 50 | bioawk -c fastx '{ print ">Acar|"$name; print $seq }' \
> /processed/Acar.fasta
And here are the commands I used for all 5 samples (note that the processing for the Boa genome is simpler because we don't have issues with isoforms like we do with the NCBI genomes). In each case I'm using 4 letter organism_ID
prefixes.
tail -n +2 raw/GCF_000090745.1_AnoCar2.0_protein_table.txt | \
sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | \
seqkit grep -f - raw/GCF_000090745.1_AnoCar2.0_protein.faa | \
seqkit seq -m 50 | bioawk -c fastx '{ print ">Acar|"$name; print $seq }' \
> processed/Acar.fasta
tail -n +2 raw/GCF_000186305.1_Python_molurus_bivittatus-5.0.2_protein_table.txt | \
sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | \
seqkit grep -f - raw/GCF_000186305.1_Python_molurus_bivittatus-5.0.2_protein.faa | \
seqkit seq -m 50 | bioawk -c fastx '{ print ">Pmol|"$name; print $seq }' \
> processed/Pmol.fasta
tail -n +2 raw/GCF_001077635.1_Thamnophis_sirtalis-6.0_protein_table.txt | \
sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | \
seqkit grep -f - raw/GCF_001077635.1_Thamnophis_sirtalis-6.0_protein.faa | \
seqkit seq -m 50 | bioawk -c fastx '{ print ">Tsir|"$name; print $seq }' \
> processed/Tsir.fasta
tail -n +2 raw/GCF_001527695.2_P.Mucros_1.0_protein_table.txt | \
sort -k9,9nr | sort -ut $'\t' -k6,6 | cut -f 8 | \
seqkit grep -f - raw/GCF_001527695.2_P.Mucros_1.0_protein.faa | \
seqkit seq -m 50 | bioawk -c fastx '{ print ">Pmuc|"$name; print $seq }' \
> processed/Pmuc.fasta
seqkit seq -m 50 raw/Bcon_rnd3.all.maker.proteins.fasta | \
bioawk -c fastx '{ print ">Bcon|"$name; print $seq }' \
> processed/Bcon.fasta
We're finally ready to run OrthoMCL
using the orthomcl-pipeline
. This is the easiest portion of this tutorial. Let's first create a directory for the output in our working directory.
mkdir orthomcl_out
Now we can construct our command and put it into a shell script. We'll use settings that tell orthomcl-pipeline
that our data is compliant with the required formatting (see above), that keeps all intermediate data (in case we want it), that splits our data into 10 chunks for running BLAST
(I have 12 cores available, so I set this to 10), and we'll pass in the orthomcl.conf
file we created earlier with the names of the input and output directories.
cat orthomcl_run.sh
# output:
# orthomcl-pipeline --no-cleanup --compliant -s 10 -m orthomcl.conf -i processed -o orthomcl_out
We can now run the pipeline, using screen
to manage the terminal and writing a log file.
screen -S orthomcl
bash ./orthomcl_run.sh 2>&1 | tee orthomcl_run.log
That's it! With these data and my system, this only took about a day to run. Adding more data (i.e., species) will increase the run time.
It turns out there isn't much out there in the way of tutorial and tools for summarizing the results of OrthoMCL
. So, here are some commands that will help when OrthoMCL
is finished.
Proteins that are not clustered into othologous groups (i.e., present in at least 2 species) are exluced from the resulting groups.txt
output file. Here is a command that will output a single line for all of the unclustered genes in all species, with unclustered:
in the first column and the gene IDs for all species in subsequent columns (space delimited). You will have to adjust for the organism IDs you are using and the location of the input compliant FASTA sequences. You can put this in its own file or append it to the existing groups.txt
file. I will do the latter and create a new file called groups.all.txt
.
for species in Acar Bcon Pmol Pmuc Tsir; \
do cat \
<(cat groups/groups.txt | \
awk -v species="$species" '{ for(i=1; i<=NF; i++) if ($i ~ species) print $i }') \
<(bioawk -c fastx '{ print $name }' ../processed/$species.fasta) | \
sort | uniq -u; \
done | \
paste -s -d ' ' | \
paste -d ' ' <(echo "unclustered:") - | \
cat groups/groups.txt - > groups/groups.all.txt
It is also useful to have counts for each cluster for each species, rather than a long list of the proteins. That's pretty easy to do as well.
cat groups/groups.all.txt | \
awk '{ acar=bcon=pmol=pmuc=tsir=0; for(i=2; i<=NF; i++) \
if ($i ~ "Acar") acar += 1; \
else if ($i ~ "Bcon") bcon += 1; \
else if ($i ~ "Pmol") pmol += 1; \
else if ($i ~ "Pmuc") pmuc += 1; \
else if ($i ~ "Tsir") tsir += 1; \
print $1, acar, bcon, pmol, pmuc, tsir }' > groups/groups.all.counts.txt
Let's go further and tabulate some basic information about orthologs.
- How many ortholog clusters have a single gene in each species (full 1:1 orthologs)?
cat groups/groups.all.counts.txt | \
awk '{ if ($2 == 1 && $3 == 1 && $4 == 1 && $5 == 1 && $6 == 1) print $0 }' | \
wc -l
- How many have a single gene in at least two of the species (1:1 orthologs)?
cat groups/groups.all.counts.txt | \
awk '{ if (($2 == 1 || $2 == 0) && ($3 == 1 || $3 == 0) && ($4 == 1 || $4 == 0) && \
($5 == 1 || $5 == 0) && ($6 == 1 || $6 == 0)) print $0 }' | \
wc -l
- How about how many of the above 1:1 orthologs are found in each genome? (The 1st column is the column number and the 2nd is the count.)
cat groups/groups.all.counts.txt | \
awk '{ if (($2 == 1 || $2 == 0) && ($3 == 1 || $3 == 0) && ($4 == 1 || $4 == 0) && \
($5 == 1 || $5 == 0) && ($6 == 1 || $6 == 0)) print $0 }' | \
awk '{ for (i=2; i<=NF; i++) sum[i] += $i } END { for (i in sum) print i, sum[i] }' | \
sort -k1,1
- Now let's figure out how many groups contain at least two paralogs that are found in only 1 of the species. (The 1st column is the column number and the 2nd is the count.)
cat groups/groups.all.counts.txt | \
awk '{ count=0; for (i=2; i<=NF; i++) if ($i == 0) count += 1; if (count == NF - 2) print $0 }' | \
awk '{ for (i=2; i<=NF; i++) sum[i] += $i } END {for (i in sum) print i, sum[i] }' | \
sort -k1,1
- How many groups are expanded (size > 1) in each species? (Note we have to factor out the last row of unclustered genes. The 1st column is the column number and the 2nd is the count.)
cat groups/groups.all.counts.txt | \
awk '{ count=0; for (i=2; i<=NF; i++) if ($i == 0) count += 1; if (count < NF - 2) print $0 }' | \
awk '{ for (i=2; i<=NF; i++) if ($i > 1) sum[i] += 1 } END { for (i in sum) print i, sum[i] - 1 }' | \
sort -k1,1
- How many groups are expanded (size = 1) in each species? This encompasses 1:1 orthologs from above. (The 1st column is the column number and the 2nd is the count.)
cat groups/groups.all.counts.txt | \
awk '{ count=0; for (i=2; i<=NF; i++) if ($i == 0) count += 1; if (count < NF - 2) print $0 }' | \
awk '{ for (i=2; i<=NF; i++) if ($i == 1) sum[i] += 1 } END { for (i in sum) print i, sum[i] }' | \
sort -k1,1
These types of data are usually plotted as stacked barplots. Also see here for another interesting way of displaying these data.