Skip to content

Instantly share code, notes, and snippets.

@genomewalker
Last active June 20, 2023 04:54
Show Gist options
  • Save genomewalker/c3d4415bda2309f4bf0eb89c988face6 to your computer and use it in GitHub Desktop.
Save genomewalker/c3d4415bda2309f4bf0eb89c988face6 to your computer and use it in GitHub Desktop.
NCBI assembly stats

In our workflow, we utilize the distinct groups in which NCBI organizes their data. These groups can be found in column 25 of the assembly_summary.txt file, as described here. The groups are as follows:

  • archaea
  • bacteria
  • fungi
  • invertebrate
  • metagenomes
  • other
  • plant
  • protozoa
  • vertebrate_mammalian
  • vertebrate_other
  • viral

The eukaryotic part of our workflow only focuses on the following:

  • fungi
  • invertebrate
  • plant
  • protozoa
  • vertebrate_mammalian
  • vertebrate_other

And we get the data from the following folders in the NCBI FTP:

- ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/"${GROUP}"
- ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/"${GROUP}"

Let's count how many assemblies do we have today (2023-06-17) in each group for genbank and refseq:

# REFSEQ
wget -O assembly_summary_refseq-20230617.txt https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

# We only collect those ones that have a link to the ftp
grep -v '^#' assembly_summary_refseq-20230617.txt | grep https | cut -f25 | sort | uniq -c
   1517 archaea
 291440 bacteria
    541 fungi
    353 invertebrate
    162 plant
     96 protozoa
    209 vertebrate_mammalian
    339 vertebrate_other
  14993 viral

# GENBANK
wget -O assembly_summary_genbank-20230617.txt https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt 

# We only collect those ones that have a link to the ftp
grep -v '^#' assembly_summary_genbank-20230617.txt | grep https | cut -f25 | sort | uniq -c
  15314 archaea
1565207 bacteria
  14719 fungi
   5662 invertebrate
   7065 metagenomes
   7589 other
   3273 plant
   1895 protozoa
   2765 vertebrate_mammalian
   3438 vertebrate_other
  71524 viral

Let's have a look at the ones we have in our DB (data retrieved 2023-01-01) before and after the different filters (not dereplication):

Our filtering keep those assemblies with "version_status == Full", "genome_rep == Full" and that the "ftp_path != na". We allow a variable assembly level (Contig, Scaffold, Chromosome, Complete Genome) If we have multiple assemblies per species, we always will select those at Chromosome or Complete Genome and discard Contig/Scaffold. The last filtering is related to the stats, we remove the outliers in terms of N50 and length.

# REFSEQ
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_refseq.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
483 fungi
321 invertebrate
159 plant
95 protozoa
200 vertebrate_mammalian
300 vertebrate_other

# GENBANK
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_genbank.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
13168 fungi
4719 invertebrate
2693 plant
1521 protozoa
2512 vertebrate_mammalian
2967 vertebrate_other

# AFTER FILTERING (REFSEQ + GENBANK)
for lib in fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other; do grep -v '^#' ${lib}/assembly_summary_combined.txt | grep https | awk -vL=${lib} 'END{print NR,L}'; done
6193 fungi
3583 invertebrate
1481 plant
978 protozoa
636 vertebrate_mammalian
2379 vertebrate_other
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment