as part of 2019 Google Summer of Code
The Global Alliance for Genomics and Health (GA4GH) helps accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.
The European Variation Archive (EVA) is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser.
The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute.
The project implements a mechanism to make EVA be in sync with the latest human data submitted to dbSNP. Given a dbSNP FTP directory with the human variant information, the pipeline parses JSONs for each of the 24 human chromosomes and writes the variants from the JSONs into the EVA archive.
Once imported to the EVA archive, this information can be distributed via EVA implementations of the GA4GH APIs htsget and Beacon specifications, as well as the EVA website - and available for public use.
The pipeline is a Spring Batch
job application with three simple steps: 1) Fetch an input JSON line, 2) Process the JSON to preserve only the required variant information, 3) Persist the information into a data store. The application repeats these steps for all the records in a given .json.bz2
file. Along with this, there are listeners in place for logging/tracing purpose.
Concretely
-
Construct an object model for dbSNP 2.0 JSON and parse the JSONs to objects in that model
-
Related PR
-
Summary The input source
.json.bz2
is uncompressed on the fly. Using a JSON processing libraryJackson
, the processor prepares a plain old java object model. The processor returns null in case the variant needs to be ignored because of some reason (for example, during anassembly accession
andRefSeq accession
mismatch). Table summary of field derivationField JSON tree Accession refsnp_id Taxonomy accession 9606 (ref) Assembly accession primary_snapshot_data > placements_with_allele (list) > placement_annot > seq_id_traits_by_assembly (list - 1st element) > assembly_accession Contig primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > seq_id Start primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > position Type primary_snapshot_data > variant_type Created date create_date
-
-
Perform desired business logic on the model, handle edge-cases
- Related PRs
- Summary The processor tries to replace the
contig
attribute of the variant with it's Genbank equivalent. Unit tests cover the edge cases that can be expected here. Step progress listeners inform about the variants read and written so far.
-
Persist the model as a document in a MongoDB collection
- Related PR
- Summary In the final step, the variant writer persists the derived information into the MongoDB data store (with collection name,
DbsnpClusteredVariantEntity
). Duplicate records are identified and logged for reporting to NCBI
- Sundar Venkataraman (@sundarvenkata-EBI)
- Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
- Jose Lopez (@jmmut)
- Rohan Kumar (@108krohan)
Code source
- #162 Add ref seq input parameter, ignore variant on assembly accession mismatch
- #156 Add processor to replace contig with Genbank accession
- #154 Mongodb item writer implementation
- #149 DbSNP 2.0 pipeline: add clustered variant reader and processor
- Java 8 development kit
- Spring batch framework batch processing
- MongoDB datastore
- Maven build automation tool
- Jackson JSON processor
Key challenges for me were understanding the structure of the deeply nested JSON inputs. Understanding the VCFs, their attributes - among others. Analysis of sample JSONs proved to be really fruitful as it helped me code manageable unit tests.
An extensive analysis of the data input can be accessed here. This sheet also provides sample analysis and result data verification by comparing this pipeline's results from Ensembl (link).
As I was new to Perl, studying the Ensembl code - and matching their results with the new import pipeline was also challenging and at the same time exciting to have completed.
Edge cases documented in separate wiki (link).
Prior assumptions also checked for all chromosomes that are to be ingested in the pipeline. You may find the python script here.
Infinite thanks to Sundar and Cristina for their continued guidance, mentorship and feedback to drive the project to completion. Also grateful to Ensembl team, and in particular, Helen, for providing answers to many of the wall-of-text doubts I had at the beginning. Thank you to Global Alliance for Genomics and Health (GA4GH), and Google for the opportunity to contribute to bioinformatics, and open source.
- The Global Alliance for Genomics and Health (GA4GH) (link)
- The European Variation Archive (link)
- 2019 Google Summer of Code project page (link)
- Schema + Sample Analysis (link)
- Python script to check assumptions (link)
- Ensemble perl script (link)
- Wiki edge case discussion (link)
- Wiki project (link)
- NCBI dbsnp 2.0 Schema API spec (link)
- NCBI ftp input (link)
- NCBI ftp chromosome variations input source (link)