genomewalker/README.md

Create FFINDEX files for each class

Get clustering file

We need the MMSEQS2 file with the clusters in fasta format, the data and index files:

marine_hmp_db_03112017_clu_fa
marine_hmp_db_03112017_clu_fa.index
marine_hmp_db_03112017_clu_rep.tsv
marine_hmp_db_03112017_clu_result2rep{.index}

Rename index to have proper cluster names in index

We need to rename the index first column to be able to query the ffdata files using the cluster names.

Get the cluster representative from MMSEQS2,

paste <(awk '{print $1}' /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.index) \
    <(zcat /bioinf/projects/megx/UNKNOWNS/2017_11/clustering/results/marine_hmp_db_03112017_clu_result2rep.tsv.gz) > mmseqs_reps.tsv

Combine our cluster names with the original MMSEQS2 names

We defined the cluster names based on the line number of the wide formatted file we created from the MMSEQS2 clustering TSV results

join -1 2 -2 2 <(sort -k2,2 --parallel=32 -S20% mmseqs_reps.tsv) \
    <(sort -k2,2 --parallel=32 -S20% <(awk '{print NR"\t"$1}' marine_hmp_db_03112017_clu_rep.tsv)) \
    | sort -k2,2n --parallel=32 -S25% > mmseqs_reps_clname.tsv

Combine it with the cluster results index

We will combine both files to be able to have index files with our cluster names, will be easy to query and process using ffindex

join -1 2 -2 1 mmseqs_reps_clname marine_hmp_db_03112017_clu_fa.index > mmseqs_reps_clname_clu_index.tsv
awk '{print $3"\t"$4"\t"$5}' mmseqs_reps_clname_clu_index.tsv > marine_hmp_db_03112017_clu_fa_clname.index
ln -s marine_hmp_db_03112017_clu_fa marine_hmp_db_03112017_clu_fa_clname

Create cluster DB

Once we have the new indices created we can easily createdb specific DBs using MMSEQS2 createsubdb and the names of the clusters.

We will use KWP as example

~/opt/MMseqs2/build/bin/mmseqs createsubdb kwp_ids.txt marine_hmp_db_03112017_clu_fa_clname marine_hmp_db_03112017_kwp

Create alignments

We can use ffindex_apply_mpi to create all the files we need:

${OPENMPI_HOME}/bin/mpirun -np 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi  marine_hmp_db_03112017_kwp marine_hmp_db_03112017_kwp.index \
      -i marine_hmp_db_03112017_kwp_aln.ffindex -d marine_hmp_db_03112017_kwp_aln.ffdata \
      -- ~/opt/FAMSA/famsa STDIN STDOUT 2> /dev/null | pv -l | wc -l

${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_a3m.ffindex -d marine_hmp_db_03112017_kwp_a3m.ffdata \
      -- ~/opt/scripts/reformat_file.sh | pv -l | wc -l

${OPENMPI_HOME}/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_cons.ffindex -d marine_hmp_db_03112017_kwp_cons.ffdata \
      -- ~/opt/scripts/consensus.sh | pv -l | wc -l

OMP_NUM_THREADS=64 cstranslate -A ${HHLIB}/data/cs219.lib -D ${HHLIB}/data/context_data.lib \
      -x 0.3 -c 4 -f -i marine_hmp_db_03112017_kwp_aln -o marine_hmp_db_03112017_kwp_cs219 -I fas -b \
      | pv -l | wc -l

$OPENMPI_HOME/bin/mpirun -n 64 --hostfile ~/.cluster_bigmem_go \
      ~/opt/ffindex_mg/bin/ffindex_apply_mpi marine_hmp_db_03112017_kwp_aln.ff{data,index} \
      -i marine_hmp_db_03112017_kwp_hhm.ffindex -d marine_hmp_db_03112017_kwp_hhm.ffdata \
      -- ~/opt/scripts/hhmake.sh | pv -l | wc -l

Optimize

As decribed in the HHSUITE manual we will optimize the indices:

Needs to be in a scratch disk

~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}

~/opt/ffindex_mg/bin/ffindex_build -as marine_hmp_db_03112017_kwp_cs219.ff{data,index}

sort -k3 -n marine_hmp_db_03112017_kwp_cs219.ffindex | cut -f1 > kwp_sorting.dat

~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
      marine_hmp_db_03112017_kwp_hhm.ff{data,index} \
      marine_hmp_db_03112017_kwp_hhm_ordered.ff{data,index}

~/opt/ffindex_mg/bin/ffindex_order kwp_sorting.dat \
      marine_hmp_db_03112017_kwp_a3m.ff{data,index} \
      marine_hmp_db_03112017_kwp_a3m_ordered.ff{data,index}

All needed scripts can be found here

	#!/bin/bash
	source ~/opt/scripts/activate_mpi
	export HHLIB=$HOME/opt/hhsuite
	export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts

	hhconsensus -maxres 65535 -i stdin -s stdout -v 0 \
	\| awk -v name="${FFINDEX_ENTRY_NAME}" 'NR==1{print ">"name; next}{print $0}'

	#/!bin/bash

	export HHLIB=/home/afernand/opt/hhsuite
	export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts

	hhmake -M 50 -nocontxt -diff 1000 -add_cons -i stdin -o stdout -v 0 -maxres 65535 -name "${FFINDEX_ENTRY_NAME}"

	#!/bin/bash
	export HHLIB=/home/afernand/opt/hhsuite
	export PATH=$PATH:$HHLIB/bin:$HHLIB/scripts

	SEQS=$(perl -ne 'print $_')
	TMP=$(mktemp -q)
	/home/afernand/opt/hhsuite/scripts/reformat.pl -v 0 -M 50 fas a3m <(echo "${SEQS}") ${TMP}
	cat ${TMP}
	rm ${TMP}