In our workflow, we utilize the distinct groups in which NCBI organizes their data. These groups can be found in column 25 of the assembly_summary.txt file, as described here. The groups are as follows:
- archaea
- bacteria
- fungi
- invertebrate
- metagenomes
- other
- plant
- protozoa
- vertebrate_mammalian
- vertebrate_other
- viral
The eukaryotic part of our workflow only focuses on the following:
- fungi
- invertebrate
- plant
- protozoa
- vertebrate_mammalian
- vertebrate_other
And we get the data from the following folders in the NCBI FTP:
- ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/"${GROUP}"
- ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/"${GROUP}"
Let's count how many assemblies do we have today (2023-06-17) in each group for genbank and refseq:
# REFSEQ
wget -O assembly_summary_refseq-20230617.txt https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
# We only collect those ones that have a link to the ftp
grep -v '^#' assembly_summary_refseq-20230617.txt | grep https | cut -f25 | sort | uniq -c
1517 archaea
291440 bacteria
541 fungi
353 invertebrate
162 plant
96 protozoa
209 vertebrate_mammalian
339 vertebrate_other
14993 viral
# GENBANK
wget -O assembly_summary_genbank-20230617.txt https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
# We only collect those ones that have a link to the ftp
grep -v '^#' assembly_summary_genbank-20230617.txt | grep https | cut -f25 | sort | uniq -c
15314 archaea
1565207 bacteria
14719 fungi
5662 invertebrate
7065 metagenomes
7589 other
3273 plant
1895 protozoa
2765 vertebrate_mammalian
3438 vertebrate_other
71524 viral
Let's have a look at the ones we have in our DB (data retrieved 2023-01-01) before and after the different filters (not dereplication):
Our filtering keep those assemblies with "version_status == Full", "genome_rep == Full" and that the "ftp_path != na". We allow a variable assembly level (Contig, Scaffold, Chromosome, Complete Genome) If we have multiple assemblies per species, we always will select those at Chromosome or Complete Genome and discard Contig/Scaffold. The last filtering is related to the stats, we remove the outliers in terms of N50 and length.
# REFSEQ
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_refseq.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
483 fungi
321 invertebrate
159 plant
95 protozoa
200 vertebrate_mammalian
300 vertebrate_other
# GENBANK
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_genbank.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
13168 fungi
4719 invertebrate
2693 plant
1521 protozoa
2512 vertebrate_mammalian
2967 vertebrate_other
# AFTER FILTERING (REFSEQ + GENBANK)
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_combined.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
6193 fungi
3583 invertebrate
1481 plant
978 protozoa
636 vertebrate_mammalian
2379 vertebrate_other