Skip to content

Instantly share code, notes, and snippets.

@genomewalker
Last active February 13, 2018 13:22
Show Gist options
  • Save genomewalker/fa9c4924aec8758cb935c950b206f2fa to your computer and use it in GitHub Desktop.
Save genomewalker/fa9c4924aec8758cb935c950b206f2fa to your computer and use it in GitHub Desktop.

Create FFINDEX files for each class

Get clustering file

We need the MMSEQS2 file with the clusters in fasta format, the data and index files:

  • marine_hmp_db_03112017_clu_fa
  • marine_hmp_db_03112017_clu_fa.index
  • marine_hmp_db_03112017_clu_rep.tsv
  • marine_hmp_db_03112017_clu_result2rep{.index}

Rename index to have proper cluster names in index

We need to rename the index first column to be able to query the ffdata files using the cluster names.

Get the cluster representative from MMSEQS2,

paste <(awk '{print $1}' /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.index) \
    <(zcat /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.tsv.gz) > mmseqs_reps.tsv

Combine our cluster names with the original MMSEQS2 names

We defined the cluster names based on the line number of the wide formatted file we created from the MMSEQS2 clustering TSV results

join -1 2 -2 2 <(sort -k2,2 --parallel=32 -S20% mmseqs_reps.tsv) \
    <(sort -k2,2 --parallel=32 -S20% <(awk '{print NR"\t"$1}' marine_hmp_db_03112017_clu_rep.tsv)) \
    | sort -k2,2n --parallel=32 -S25% > mmseqs_reps_clname.tsv

Combine it with the cluster results index

We will combine both files to be able to have index files with our cluster names, will be easy to query and process using ffindex

join -1 2 -2 1 mmseqs_reps_clname marine_hmp_db_03112017_clu_fa.index > mmseqs_reps_clname_clu_index.tsv
awk '{print $3"\t"$4"\t"$5}' mmseqs_reps_clname_clu_index.tsv > marine_hmp_db_03112017_clu_fa_clname.index
ln -s marine_hmp_db_03112017_clu_fa marine_hmp_db_03112017_clu_fa_clname

Create cluster DB

Once we have the new indices created we can easily createdb specific DBs using MMSEQS2 createsubdb and the names of the clusters.

We will use KWP as example

~/opt/MMseqs2/build/bin/mmseqs createsubdb kwp_ids.txt marine_hmp_db_03112017_clu_fa_clname marine_hmp_db_03112017_kwp

Create alignments

We can use ffindex_apply_mpi to create all the files we need:

${OPENMPI_HOME}/bin/mpirun -np 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi  marine_hmp_db_03112017_kwp marine_hmp_db_03112017_kwp.index \
      -i marine_hmp_db_03112017_kwp_aln.ffindex -d marine_hmp_db_03112017_kwp_aln.ffdata \
      -- ~/opt/FAMSA/famsa STDIN STDOUT 2> /dev/null | pv -l | wc -l

${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_a3m.ffindex -d marine_hmp_db_03112017_kwp_a3m.ffdata \
      -- ~/opt/scripts/reformat_file.sh | pv -l | wc -l

${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_cons.ffindex -d marine_hmp_db_03112017_kwp_cons.ffdata \
      -- ~/opt/scripts/consensus.sh | pv -l | wc -l

OMP_NUM_THREADS=64 cstranslate -A ${HHLIB}/data/cs219.lib -D ${HHLIB}/data/context_data.lib \
      -x 0.3 -c 4 -f -i marine_hmp_db_03112017_kwp_aln -o marine_hmp_db_03112017_kwp_cs219 -I fas -b \
      | pv -l | wc -l

$OPENMPI_HOME/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_hhm.ffindex -d marine_hmp_db_03112017_kwp_hhm.ffdata \
      -- ~/opt/scripts/hhmake.sh | pv -l | wc -l

Optimize

As decribed in the HHSUITE manual we will optimize the indices:

Needs to be in a scratch disk

~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}
~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}
sort -k3 -n marine_hmp_db_03112017_kwp_cs219.ffindex | cut -f1 > kwp_sorting.dat
~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
      marine_hmp_db_03112017_kwp_hhm.ff{data,index} \
      marine_hmp_db_03112017_kwp_hhm_ordered.ff{data,index}
~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
      marine_hmp_db_03112017_kwp_a3m.ff{data,index} \
      marine_hmp_db_03112017_kwp_a3m_ordered.ff{data,index}

All needed scripts can be found here

#!/bin/bash
source ~/opt/scripts/activate_mpi
export HHLIB=$HOME/opt/hhsuite
export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
hhconsensus -maxres 65535 -i stdin -s stdout -v 0 \
| awk -v name="${FFINDEX_ENTRY_NAME}" 'NR==1{print ">"name; next}{print $0}'
#/!bin/bash
export HHLIB=/home/afernand/opt/hhsuite
export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
hhmake -M 50 -nocontxt -diff 1000 -add_cons -i stdin -o stdout -v 0 -maxres 65535 -name "${FFINDEX_ENTRY_NAME}"
#!/bin/bash
export HHLIB=/home/afernand/opt/hhsuite
export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts
SEQS=$(perl -ne 'print $_')
TMP=$(mktemp -q)
/home/afernand/opt/hhsuite/scripts/reformat.pl -v 0 -M 50 fas a3m <(echo "${SEQS}") ${TMP}
cat ${TMP}
rm ${TMP}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment