Skip to content

Instantly share code, notes, and snippets.

@clintval
Created December 23, 2019 05:14
Show Gist options
  • Select an option

  • Save clintval/38afe5fc33e179902f09290a186c3e36 to your computer and use it in GitHub Desktop.

Select an option

Save clintval/38afe5fc33e179902f09290a186c3e36 to your computer and use it in GitHub Desktop.
GenBank Accession Number Reference Sheet

GenBank Accession Number Reference Sheet:

The International Nucleotide Sequence Database Collaboration (INSDC) consists of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank at NCBI. As part of the Collaboration, all three organizations accept new sequence submissions and share sequence data among the three databases. To facilitate the exchange of data, each member of the collaboration is assigned certain accession prefixes. In addition to the accession number, GenBank records also have a GI number. The GI number is simply a series of digits assigned consecutively to sequences submitted to NCBI.

Format of GenBank accession numbers:

Type Format
Nucleotide 1 letter + 5 numbers or 2 letters + 6 numbers
Protein 3 letters + 5 numbers
WGS 4 letters + 2 numbers for WGS assembly version + 6-8 numerals

Primary GenBank accession number prefixes:

Prefixes Data Source
AE, CP, CY Genome projects (nucleotide)
U, AF, AY, DQ Direct submissions (nucleotide)
AAAA-AZZZ Whole genome shotgun sequences (nucleotide)
AAA-AZZ Protein ID
EAA-EZZ WGS protein ID
O, P, Q Swissprot (protein)

Version number suffix: GenBank sequence identifiers consist of an accession number of the record followed by a dot and a version number (i.e. accession.version ). The version number is incremented whenever the sequence record is updated.

Refseq Accession Format: Refseq accession numbers do not follow the standards set by INSDC. It has a distinct format of 2 letters + underbar + 6 digits (i.e. NM_012345). Refseq records can either be curated (manually reviewed by NCBI staff or collaborators) or automated (records not individually reviewed).

Prefixes Molecule Method
NC, NG Genomic Curated
NM MRNA Curated
NR RNA Curated
NP Protein Curated
NT, NW Genomic Automated
XM MRNA Automated
XR RNA Automated
XP Protein Automated

The complete list of accession numbers is available at http://www.ncbi.nlm.nih.gov/Sequin/acc.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment