Skip to content

Instantly share code, notes, and snippets.

@donbr
Last active March 24, 2025 01:20
Show Gist options
  • Save donbr/7f92fcccf834dfbae2470bb7adadd81c to your computer and use it in GitHub Desktop.
Save donbr/7f92fcccf834dfbae2470bb7adadd81c to your computer and use it in GitHub Desktop.
Biological Database Classification

Biological Database Classification

The classification of biological databases follows a hierarchical system similar to information organization in other fields:

  • Primary Databases: Contain raw experimental data directly submitted by researchers
  • Secondary Databases: Contain processed, analyzed, and annotated data derived from primary databases
  • Tertiary Databases: Integrate and synthesize information from multiple primary and secondary sources
  • Mixed Databases: Incorporate aspects of multiple classification levels

Primary Databases

Database Main Focus Description
DDBJ Nucleotide Sequences DNA Data Bank of Japan, one of the three international nucleotide sequence databases
EMBL Nucleotide Sequences European Molecular Biology Laboratory nucleotide sequence database
GenBank Nucleotide Sequences NIH genetic sequence database, annotated collection of publicly available DNA sequences
GEO Gene Expression Gene Expression Omnibus, repository for high-throughput gene expression and related data
PDB Protein Structures Protein Data Bank, repository of 3D structural data of biological macromolecules
PubMed Central Literature Repository of biomedical and life sciences journal literature
RefSeq Reference Sequences NCBI Reference Sequence Database, collection of reference sequence standards
TrEMBL (UniProt) Protein Sequences Automatically annotated protein sequence database (part of UniProt)

Secondary Databases

Database Main Focus Description
BioGRID Interaction Data Biological General Repository for Interaction Datasets, contains curated protein, genetic, and chemical interactions
COMPARTMENTS Protein Localization Database of protein subcellular localization evidence
COG Orthologous Groups Clusters of Orthologous Groups, classification of proteins encoded in complete genomes
DIP Protein Interactions Database of Interacting Proteins, experimentally determined protein-protein interactions
DISEASES Disease Associations Integrates evidence on disease-gene associations from various sources
EggNOG Orthologous Groups Evolutionary genealogy of genes: Non-supervised Orthologous Groups, hierarchical classification of proteins
Gene Ontology Functional Annotations Controlled vocabulary of gene and gene product attributes across species
HPRD Protein Reference Human Protein Reference Database, curated proteomic information for human proteins
HUGO (HGNC) Gene Nomenclature HUGO Gene Nomenclature Committee, standardized nomenclature for human genes
Intact Molecular Interactions Database of molecular interaction data
InterPro Protein Families Integrates protein signatures from multiple databases to classify proteins
MINT Molecular Interactions Molecular INTeraction database, experimentally verified protein-protein interactions
OMIM Human Disease Genes Online Mendelian Inheritance in Man, catalog of human genes and genetic disorders
Pfam Protein Families Database of protein families represented by multiple sequence alignments and HMMs
proGenomes Genome Classification Database of prokaryotic genomes with taxonomic classification
ProteomeHD Co-regulation Data Contains co-regulation data for human proteins
SIMAP Protein Similarity Similarity Matrix of Proteins, precalculated similarity relationships between proteins
SMART Protein Domains Simple Modular Architecture Research Tool, identification and annotation of protein domains
SwissModel Protein Structures Database of annotated 3D protein structure homology models
TISSUES Protein Expression Database of protein expression patterns in tissues

Tertiary Databases

Database Main Focus Description
BioCyc Pathways/Genomes Collection of organism-specific Pathway/Genome Databases integrating genomic and metabolic pathway data
KEGG Pathways and Systems Kyoto Encyclopedia of Genes and Genomes, integrates genomic, chemical, and system functional information
Reactome Biological Pathways Curated and peer-reviewed database of reactions, pathways and biological processes
WikiPathways Biological Pathways Community-curated open pathway database

Mixed Databases

Database Main Focus Description
Ensembl Genome Browser/Annotation Genome browser that produces and maintains automatic annotation on selected eukaryotic genomes
FlyBase Model Organism Database Database of genetic, genomic, and functional data for Drosophila
SGD Model Organism Database Saccharomyces Genome Database, comprehensive resource for yeast biology
SwissProt (UniProt) Protein Sequences Manually annotated, high-quality protein sequence database (part of UniProt)
UniProt Protein Sequences/Function Universal Protein Resource, comprehensive resource of protein sequence and function
WormBase Model Organism Database Database for Caenorhabditis elegans and related nematodes

This classification reflects the nature of the data stored in each database, how it's processed, and its relationship to other data sources. Many databases have evolved over time to incorporate aspects of multiple categories as biological data becomes increasingly interconnected.

I'll use sequential thinking to analyze each database more thoroughly and create a comprehensive table. Let me first use sequential thinking to ensure I have a systematic approach.

Let me try again with the sequential thinking tool:

I'll create a comprehensive table classifying the biological databases according to the primary/secondary/tertiary/mixed categorization system based on my research.

Based on my research using the web search tool and sequential thinking, I've created a comprehensive classification of the biological databases you listed. The table organizes them into primary, secondary, tertiary, and mixed categories, with descriptions of each database's focus and purpose.

Primary databases contain raw experimental data directly submitted by researchers, such as nucleotide sequences (GenBank, EMBL, DDBJ), protein structures (PDB), and gene expression data (GEO).

Secondary databases provide processed and analyzed information derived from primary data, including protein families (InterPro, Pfam), functional annotations (Gene Ontology), and interaction networks (BioGRID, DIP, MINT).

Tertiary databases integrate information across multiple sources to provide higher-level biological insights, particularly pathway and systems information (KEGG, BioCyc, WikiPathways).

Mixed databases span multiple categories, especially model organism databases (FlyBase, WormBase, SGD) and comprehensive resources like UniProt that include both primary sequence data and secondary annotations.

The classification table should serve as a useful reference for understanding the different types of biological databases in the EMBL ecosystem and how they relate to each other in the information hierarchy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment