The INSDC Reference Sequence Public Dataset enables access to biological reference sequences submitted to the INSDC, where sequences are identified according to checksum. The dataset includes both raw sequence as well as associated metadata.
insdc-reference-sequences
|-- sequence
| |-- 023e92ccde5f86f31ea0844a92dddb86
| |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50
| |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f
| |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
| |-- dbe6100b83178f3ac561d98c2dfc41a0
| |-- ff734bf70e13affa85a272fda6659a5f
| |__ *
|__ metadata
|-- json
| |-- 023e92ccde5f86f31ea0844a92dddb86.json
| |-- 8238c4f8a7915991ac98d769837f9b4b91da2a2297598e50.json
| |-- bf237796417701948b5f6005d72ca5a0376f3c89e95a1c4f.json
| |-- c2424a8ffca9cf8f9ef46cfdd5f69efede74b44e820c178a
| |-- dbe6100b83178f3ac561d98c2dfc41a0
| |-- ff734bf70e13affa85a272fda6659a5f
| |__ *.json
|__ csv
|-- AAIYXD01.full.csv
|-- CABIKC01.full.csv
|-- LVHX01.full.csv
|__ *.full.csv
Logs of upload processing events are available from s3://PDS/metadata/csv
. Each file represents a single load attempt (generally containing all sequences in an assembly), and each line is a loaded sequence. The following table outlines data columns for each sequence record.
# | Field Name | Description | Example |
---|---|---|---|
1 | trunc512 | Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence, truncated to 48 characters | b9046fc3fb417f114d7e108637c448b2 14d78b7a5e345c7c |
2 | md5 | Message Digest (MD5) hex-string digest of sequence (32 characters) | cd8d02e2d8af721bed2ba9392a96da0e |
3 | length | Sequence base pair length | 1470266 |
4 | sha512 | Secure Hash Algorithm (SHA) 512-bit hex-string digest of sequence (128 characters) | b9046fc3fb417f114d7e108637c448b2 14d78b7a5e345c7c1d527fd895f081d1 109da900101f323d142a407ef22cbfb6 c2a174eb796217d1afa7fbbe1564787a |
5 | trunc512_base64 | Base64 representation of trunc512 digest (32 characters) |
uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8 |
6 | insdc | INSDC Versioned Accession Number | CABIKC010000001.1 |
7 | ena_type | Record type | expanded_con |
8 | species | Human readable species taxonomic name (ie. Genus species) | "Saccharomyces cerevisiae" |
9 | biosample | BioSample Accession | SAMEA5816324 |
10 | taxon | NCBI Taxonomy species identifier | 4932 |
An example csv of loaded sequences is displayed below as a table. The table shows a subset of sequences from assembly GCA_902192315.1, a Saccharomyces cerevisiae genome assembly.
trunc512 |
md5 |
length |
sha512 |
trunc512_base64 |
insdc |
ena_type |
species |
biosample |
taxon |
---|---|---|---|---|---|---|---|---|---|
b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c |
cd8d02e2d8af721bed2ba9392a96da0e |
1470266 |
b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c1d527fd895f081d1109da900101f323d142a407ef22cbfb6c2a174eb796217d1afa7fbbe1564787a |
uQRvw_tBfxFNfhCGN8RIshTXi3peNFx8 |
CABIKC010000001.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
8decdfa7b43090448ae9411a77e2105390855bd0770e0ded |
fa6ea9d18d255f0586cf967071bacf8a |
1062691 |
8decdfa7b43090448ae9411a77e2105390855bd0770e0ded4ec8d19bc17e0d2c2af4c9c38c694502d061bd310547020df5ded87641450a20a6e24985bef5904c |
jezfp7QwkESK6UEad-IQU5CFW9B3Dg3t |
CABIKC010000002.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e14 |
e6d87174b53bc10a65e8b30363fe994f |
1092091 |
e64dd23642d2f4fcd9646eaf844f8b1b66e8dc6ca7199e143e84b54652c583958f642604497bdfcf5491c9879da9b8d3e46b8a3f010859e0d27367ab82f7c808 |
5k3SNkLS9PzZZG6vhE-LG2bo3GynGZ4U |
CABIKC010000003.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465 |
178f3cd414e0f23b97397f705fade52d |
912642 |
b9cc12a05937d362b5a55dc4a38850782c034cdf682bd465078fc5bbb321b09c4ad3f9717c52778d08ffb03673c317e3ab18836d110f1db86965c1840ee0bb66 |
ucwSoFk302K1pV3Eo4hQeCwDTN9oK9Rl |
CABIKC010000004.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875 |
bba885139f4796326100ff9db55b9235 |
815240 |
ae6c673a1878afc4bf4a2df1ed9667d116e1e294015f8875c6db3be3b313adb9f854b41d10faf4cc796621640158eef5b9f467dd4ff107b7bd9988f5be105b16 |
rmxnOhh4r8S_Si3x7ZZn0Rbh4pQBX4h1 |
CABIKC010000005.1 |
expanded_con |
"Saccharomyces cerevisiae" |
SAMEA5816324 |
4932 |
You can download sequences/metadata via the curl command-line tool. Be sure to include the -L
flag, which will redirect to sequence content (stored under the TRUNC512 id file) from the MD5 id file.
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json
You can use the requests library in Python to download sequences and metadata.
import requests
url_sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response_sequence = requests.get(url_sequence)
print(response_sequence.content)
url_metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response_metadata = requests.get(url_metadata)
print(response_metadata.content)
You can use the httr library in R to download sequences and metadata.
library(httr)
url.sequence <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0"
response.sequence <- GET(url.sequence)
content(response.sequence, "text")
url.metadata <- "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
response.metadata <- GET(url.metadata)
content(response.metadata, "text")
String sequence = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/dbe6100b83178f3ac561d98c2dfc41a0";
URL urlSequence = new URL(sequence);
HttpURLConnection connectionSequence = (HttpURLConnection) urlSequence.openConnection();
connectionSequence.setRequestMethod("GET");
String metadata = "http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/json/dbe6100b83178f3ac561d98c2dfc41a0.json"
URL urlMetadata = new URL(metadata);
HttpURLConnection connectionMetadata = (HttpURLConnection) urlMetadata.openConnection();
connectionMetadata.setRequestMethod("GET");
Given a genome assembly of interest, the csv data can be used to get checksums, and therefore raw sequence, for all sequences in the assembly. For example, to locate all sequences for assembly GCA_902192315.1, we can request the following to access the CSV:
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/metadata/csv/CABIKC01.full.csv
The first and second columns of the resulting csv give us the TRUNC512
and MD5
identifiers, respectively, of all sequences in the assembly. We can use either identifier to download each sequence. Given that the first sequence has a TRUNC512
id of b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c
, we can request:
curl -L http://ga4gh-refget.s3-website.us-east-2.amazonaws.com/sequence/b9046fc3fb417f114d7e108637c448b214d78b7a5e345c7c
The above process can be repeated for all sequences to collect and reconstruct the entire assembly.