We need the MMSEQS2 file with the clusters in fasta format, the data and index files:
- marine_hmp_db_03112017_clu_fa
- marine_hmp_db_03112017_clu_fa.index
- marine_hmp_db_03112017_clu_rep.tsv
- marine_hmp_db_03112017_clu_result2rep{.index}
We need to rename the index first column to be able to query the ffdata files using the cluster names.
Get the cluster representative from MMSEQS2,
paste <(awk '{print $1}' /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.index) \
<(zcat /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.tsv.gz) > mmseqs_reps.tsv
We defined the cluster names based on the line number of the wide formatted file we created from the MMSEQS2 clustering TSV results
join -1 2 -2 2 <(sort -k2,2 --parallel=32 -S20% mmseqs_reps.tsv) \
<(sort -k2,2 --parallel=32 -S20% <(awk '{print NR"\t"$1}' marine_hmp_db_03112017_clu_rep.tsv)) \
| sort -k2,2n --parallel=32 -S25% > mmseqs_reps_clname.tsv
We will combine both files to be able to have index files with our cluster names, will be easy to query and process using ffindex
join -1 2 -2 1 mmseqs_reps_clname marine_hmp_db_03112017_clu_fa.index > mmseqs_reps_clname_clu_index.tsv
awk '{print $3"\t"$4"\t"$5}' mmseqs_reps_clname_clu_index.tsv > marine_hmp_db_03112017_clu_fa_clname.index
ln -s marine_hmp_db_03112017_clu_fa marine_hmp_db_03112017_clu_fa_clname
Once we have the new indices created we can easily createdb specific DBs using MMSEQS2 createsubdb and the names of the clusters.
We will use KWP as example
~/opt/MMseqs2/build/bin/mmseqs createsubdb kwp_ids.txt marine_hmp_db_03112017_clu_fa_clname marine_hmp_db_03112017_kwp
We can use ffindex_apply_mpi to create all the files we need:
${OPENMPI_HOME}/bin/mpirun -np 64 --hostfile ~/.cluster_bigmem_go \
~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp marine_hmp_db_03112017_kwp.index \
-i marine_hmp_db_03112017_kwp_aln.ffindex -d marine_hmp_db_03112017_kwp_aln.ffdata \
-- ~/opt/FAMSA/famsa STDIN STDOUT 2> /dev/null | pv -l | wc -l
${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
-i marine_hmp_db_03112017_kwp_a3m.ffindex -d marine_hmp_db_03112017_kwp_a3m.ffdata \
-- ~/opt/scripts/reformat_file.sh | pv -l | wc -l
${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
-i marine_hmp_db_03112017_kwp_cons.ffindex -d marine_hmp_db_03112017_kwp_cons.ffdata \
-- ~/opt/scripts/consensus.sh | pv -l | wc -l
OMP_NUM_THREADS=64 cstranslate -A ${HHLIB}/data/cs219.lib -D ${HHLIB}/data/context_data.lib \
-x 0.3 -c 4 -f -i marine_hmp_db_03112017_kwp_aln -o marine_hmp_db_03112017_kwp_cs219 -I fas -b \
| pv -l | wc -l
$OPENMPI_HOME/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
-i marine_hmp_db_03112017_kwp_hhm.ffindex -d marine_hmp_db_03112017_kwp_hhm.ffdata \
-- ~/opt/scripts/hhmake.sh | pv -l | wc -l
As decribed in the HHSUITE manual we will optimize the indices:
Needs to be in a scratch disk
~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}
~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}
sort -k3 -n marine_hmp_db_03112017_kwp_cs219.ffindex | cut -f1 > kwp_sorting.dat
~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
marine_hmp_db_03112017_kwp_hhm.ff{data,index} \
marine_hmp_db_03112017_kwp_hhm_ordered.ff{data,index}
~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
marine_hmp_db_03112017_kwp_a3m.ff{data,index} \
marine_hmp_db_03112017_kwp_a3m_ordered.ff{data,index}
All needed scripts can be found here