We will screen the RefSeq genomes for genoimc unknowns. We will use the non-redundant set of genomes identified by MASH+t-SNE+PAM clustering.
cd /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
mkdir genomes
rsync -Pauvz /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/proteomes/downloads/ genomes/
find genomes/ -name '*gz' \
| while read file; do NAM=$(basename $file _protein.faa.gz); zcat $file \
| mawk -vL=$NAM '$0 ~/^>/{gsub(">", "", $0); print L"\t"$0}'; done >> all_refseq82_nr_names.tsv
And rename the headers:
mkdir genomes_renamed
cd genomes_renamed
parallel -j 252 --progress ../rename_headers.sh ::: /bioinf/home/afernand/SANDBOX/unk_vs_refseq/genomes/*faa
cd ..
find genomes_renamed/ -name '*.faa' \
| while read line; do printf "/bioinf/home/afernand/SANDBOX/unk_vs_refseq/genomes_renamed/$line\n"; done > genomes_list.txt
find genomes/ -name '*gz' \
| while read file; do NAM=$(basename $file _protein.faa.gz); zcat $file | mawk -vL=$NAM '$0 ~/^>/{gsub(">", "", $0); print L"\t"$0}'; done >> all_refseq82_nr_names.tsv
We need the ffindex files with the HHM profiles, and convert them to MMSEQS2 format. First we copy the HHM profiles to scratch and we convert them:
mkdir /scratch/antonio/mmseqs2_profiles
cd /scratch/antonio/mmseqs2_profiles
cp /bioinf/projects/megx/UNKNOWNS/2017_11/classification/results/ffindex_data/marine_hmp_db_03112017_gu_hhm.ffdata marine_hmp_db_03112017_gu_hhm
cp /bioinf/projects/megx/UNKNOWNS/2017_11/classification/results/ffindex_data/marine_hmp_db_03112017_gu_hhm.ffindex marine_hmp_db_03112017_gu_hhm.index
~/opt/MMseqs2_uv/build/bin/mmseqs convertprofiledb marine_hmp_db_03112017_gu_hhm marine_hmp_db_03112017_gu_hhm_profileDB --threads 16
Once we have them converted we move them to the shared file system, we will use start the jobs in the clusters. One job per genome.
mv *profile* /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
cd /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
rm -rf /scratch/antonio/mmseqs2_profiles
We might need to redo the symlinks to the seq_h files
rm marine_hmp_db_03112017_gu_hhm_profileDB_seq_h
ln -s /bioinf/home/afernand/SANDBOX/unk_vs_refseq/marine_hmp_db_03112017_gu_hhm_profileDB_h marine_hmp_db_03112017_gu_hhm_profileDB_seq_h
ln -s /bioinf/home/afernand/SANDBOX/unk_vs_refseq/marine_hmp_db_03112017_gu_hhm_profileDB_h.index marine_hmp_db_03112017_gu_hhm_profileDB_seq_h.index
We will start an array in the cluster to process each genome independently
mkdir /bioinf/home/afernand/SANDBOX/unk_vs_refseq/results
qsub gu_vs_refseq_sge.sh
Needs to be in any of the machines with access to the cluster (sunrays, linux-desktop, arb-servers)
Once the search is done we will combine results
parallel -j 252 --progress cat {} > gu_vs_refseq82.tsv ::: results/*m8
cut -f 16- -d " " gu_vs_refseq82.tsv > tmp
paste <(cat gu_vs_refseq82.tsv | cut -f 1-15 -d " " | tr " " $'\t') tmp > gu_vs_refseq82_tab.tsv
and parse the hypotheticals and get the taxonomies:
bash ~/opt/scripts/hypo_parse.sh gu_vs_refseq82_tab.tsv ~/opt/scripts/unknown_grep.tsv gu_vs_refseq82_hypo.tsv
cp /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/genomes_taxonomy.tsv.gz .
For plotting use the script here