Skip to content

Instantly share code, notes, and snippets.

@tomsing1
Created March 10, 2022 02:03
Show Gist options
  • Save tomsing1/3627742b5122a2551349590e26a4738b to your computer and use it in GitHub Desktop.
Save tomsing1/3627742b5122a2551349590e26a4738b to your computer and use it in GitHub Desktop.
Notes on ENA's REST apis for computational retrieval of NGS metadata

ENA APIs

The ENA has multiple APIs. The most important ones are:

  1. ENA Portal API: search ENA's databases using (potentially complex) queries.
  2. ENA Browser API: retrieve entire records programmatically

In addition, quick summaries of metadata and file retrieval locations can be retrieved as ENA file reports.

ENA object relationships

To use the APIs, it is helpful to understand the relationships between different objects in ENA:

Object relationships

  • Run: a lane (or equivalent) on an sequencing machine, used to attach sequence read data to experiments.
  • Experiment: represents the library solution that is created from a sample and used in a sequencing experiment. The experiment object contains details about the sequencing platform and library protocols.
  • Study: A study groups together experiments to allow them to be cited together in a publication.
  • Sample: A biological sample that was used to create a library (= experiment). It is common to have multiple libraries and sequencing experiments for a single sample.

In summary:

  • One or more runs are part of an experiment.
  • One or more experiments are part of a study.
  • One or more experiments are associated with a sample.

ENA Portal API: search

The ENA Portal can be used to search various databases and return many (but not all) fields.

Building a query

A query for the ENA REST API can contain multiple fields. The Advanced search web application can be used to explore the different fields and allowed values to construct complex queries.

Each query starts with a result type that determines what fields to search against.

The following query will return a list of all available results:

https://www.ebi.ac.uk/ena/portal/api/results?dataPortal=ena

At the time of writing, it returned the following:

resultId	description
analysis_study	Studies used for nucleotide sequence analyses from reads
analysis	Nucleotide sequence analyses from reads
assembly	Genome assemblies
coding	Coding sequences
wgs_set	Genome assembly contig sets (WGS)
tsa_set	Transcriptome assembly contig sets (TSA)
tls_set	Targeted locus study contig sets (TLS)
environmental	Environmental samples
noncoding	Non-coding sequences
read_study	Studies used for raw reads
read_experiment	Experiments used for raw reads
read_run	Raw reads
sample	Samples
sequence	Nucleotide sequences
study	Studies
taxon	Taxonomic classification

If we are interested in looking for Raw reads, then we choose the read_run result type. This choice defines which database fields can be queried. A full set of fields available for querying is also available under this URL:

https://www.ebi.ac.uk/ena/portal/api/searchFields?result=read_run

At the time of writing, it returned the following:

columnId	description
accession	accession number
altitude	Altitude (m)
assembly_quality	Quality of assembly
assembly_software	Assembly software
base_count	number of base pairs
binning_software	Binning software
bio_material	identifier for biological material including institute and collection code
broker_name	broker name
cell_line	cell line from which the sample was obtained
cell_type	cell type from which the sample was obtained
center_name	Submitting center
checklist	checklist name (or ID)
collected_by	name of the person who collected the specimen
collection_date	date that the specimen was collected
collection_date_submitted	Collection date submitted
completeness_score	Completeness score (%)
contamination_score	Contamination score (%)
country	locality of sample isolation: country names, oceans or seas, followed by regions and localities
cultivar	cultivar (cultivated variety) of plant from which sample was obtained
culture_collection	identifier for the sample culture including institute and collection code

[truncated]

Retrieving search results in XML format

To obtain the search results in XML format, the query can also be used in the ENA Browser API (see below), via its search endpoint:

https://www.ebi.ac.uk/ena/browser/api/xml/search?result=read_run&query=secondary_study_accession="SRP212869"

Searching for raw reads (FASTQ/BAM files)

To return a list of all accessions with raw reads (e.g. runs) for the same study, we can perform a search in the Portal API against the read_run result:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=secondary_study_accession="SRP212869"

File reports

The filereport API endpoint offers summarised reports about a provided accession. It bypasses the search and fetches information directly from a data cache, increasing the speed of delivery.

Note: The &result=read_run query must be added to the URL.

For example, the following URL retrieves the file report for study SRP212869 in TSV format, including all run_accessions and the FTP locations of their FASTQ files:

https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRP212869&result=read_run

File reports are available for the following accession types:

  • Study accessions (ERP, SRP, DRP, PRJ prefixes)
  • Experiment accessions (ERX, SRX, DRX prefixes)
  • Sample accessions (ERS, SRS, DRS, SAM prefixes)
  • Run accessions (ERR, SRR, DRR prefixes)

Examples

  1. Find run accessions and descriptions/titles for all runs in the study with secondary_study_accession SRP212869:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=secondary_study_accession=SRP212869&format=tsv

We can specify the result format as ‘&format=tsv’ or ‘&format=json’. TSV is the default.

  1. Find the same data using the official ENA study accession PRJNA552470 instead:
https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=study_accession=PRJNA552470&format=tsv
  1. Specify additional fields

Hint: Use the Advanced search

Look up just a few fields:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=study_accession=PRJNA552470&fields=sample_accession,experiment_accession,study_accession

or lots of fields:

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=study_accession=PRJNA552470&fields=accession,center_name,description,experiment_accession,experiment_alias,fastq_aspera,fastq_ftp,instrument_model,library_layout,library_selection,library_source,library_strategy,parent_study,read_count,sample_accession,sample_alias,sample_description,sample_title,secondary_sample_accession,secondary_study_accession,scientific_name,study_accession,study_alias,study_title,submission_accession,tax_id,tissue_type&format=tsv

ENA Browser API: record retrieval

The ENA Browser API can be used to retrieve entire ENA Records in EMBL flat file, fasta or XML format dependent on the record type.

The endpoint to retrieve records by accession is:

https://www.ebi.ac.uk/ena/browser/api/xml/<accession>

Examples

  1. Return the record for Sample SAMN03401168 in XML format
https://www.ebi.ac.uk/ena/browser/api/xml/SAMN03401168
  1. Return records for multiple Experiment accessions in XML format, we concatenate the accessions with commas:
https://www.ebi.ac.uk/ena/browser/api/xml/SRX952421,SRX952422
  1. Return records for multiple Runs
https://www.ebi.ac.uk/ena/browser/api/xml/SRR11028503,SRR11028504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment