Data pipeline for exchange of human genomic variation between public repositories

as part of 2019 Google Summer of Code

About the organisation

The Global Alliance for Genomics and Health (GA4GH)

The Global Alliance for Genomics and Health (GA4GH) helps accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.

European Variation Archive (EVA)

The European Variation Archive (EVA) is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser.

About the project

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute.

The project implements a mechanism to make EVA be in sync with the latest human data submitted to dbSNP. Given a dbSNP FTP directory with the human variant information, the pipeline parses JSONs for each of the 24 human chromosomes and writes the variants from the JSONs into the EVA archive.

Once imported to the EVA archive, this information can be distributed via EVA implementations of the GA4GH APIs htsget and Beacon specifications, as well as the EVA website - and available for public use.

High level diagram

Tasks

The pipeline is a Spring Batch job application with three simple steps: 1) Fetch an input JSON line, 2) Process the JSON to preserve only the required variant information, 3) Persist the information into a data store. The application repeats these steps for all the records in a given .json.bz2 file. Along with this, there are listeners in place for logging/tracing purpose.

Concretely

Construct an object model for dbSNP 2.0 JSON and parse the JSONs to objects in that model

Related PR
- #149 DbSNP 2.0 pipeline: add clustered variant reader and processor

Summary The input source .json.bz2 is uncompressed on the fly. Using a JSON processing library Jackson, the processor prepares a plain old java object model. The processor returns null in case the variant needs to be ignored because of some reason (for example, during an assembly accession and RefSeq accession mismatch). Table summary of field derivation

Field	JSON tree
Accession	refsnp_id
Taxonomy accession	9606 (ref)
Assembly accession	primary_snapshot_data > placements_with_allele (list) > placement_annot > seq_id_traits_by_assembly (list - 1st element) > assembly_accession
Contig	primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > seq_id
Start	primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > position
Type	primary_snapshot_data > variant_type
Created date	create_date

Perform desired business logic on the model, handle edge-cases
- Related PRs
  - #156 Add processor to replace contig with Genbank accession
  - #162 Add ref seq input parameter, ignore variant on assembly accession mismatch
- Summary The processor tries to replace the contig attribute of the variant with it's Genbank equivalent. Unit tests cover the edge cases that can be expected here. Step progress listeners inform about the variants read and written so far.
Persist the model as a document in a MongoDB collection
- Related PR
  - #154 Mongodb item writer implementation
- Summary In the final step, the variant writer persists the derived information into the MongoDB data store (with collection name, DbsnpClusteredVariantEntity). Duplicate records are identified and logged for reporting to NCBI

Team

Sundar Venkataraman (@sundarvenkata-EBI)
Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
Jose Lopez (@jmmut)
Rohan Kumar (@108krohan)

Code

Code source

Project wikis

PRs merged

Technologies

Java 8 development kit
Spring batch framework batch processing
MongoDB datastore
Maven build automation tool
Jackson JSON processor

Challenges

Data

Key challenges for me were understanding the structure of the deeply nested JSON inputs. Understanding the VCFs, their attributes - among others. Analysis of sample JSONs proved to be really fruitful as it helped me code manageable unit tests.

An extensive analysis of the data input can be accessed here. This sheet also provides sample analysis and result data verification by comparing this pipeline's results from Ensembl (link).

Edge cases

As I was new to Perl, studying the Ensembl code - and matching their results with the new import pipeline was also challenging and at the same time exciting to have completed.

Edge cases documented in separate wiki (link).

Prior assumptions also checked for all chromosomes that are to be ingested in the pipeline. You may find the python script here.

End credits

Infinite thanks to Sundar and Cristina for their continued guidance, mentorship and feedback to drive the project to completion. Also grateful to Ensembl team, and in particular, Helen, for providing answers to many of the wall-of-text doubts I had at the beginning. Thank you to Global Alliance for Genomics and Health (GA4GH), and Google for the opportunity to contribute to bioinformatics, and open source.

References

The Global Alliance for Genomics and Health (GA4GH) (link)
The European Variation Archive (link)
2019 Google Summer of Code project page (link)
Schema + Sample Analysis (link)
Python script to check assumptions (link)
Ensemble perl script (link)
Wiki edge case discussion (link)
Wiki project (link)
NCBI dbsnp 2.0 Schema API spec (link)
NCBI ftp input (link)
NCBI ftp chromosome variations input source (link)

108krohan/GA4GH_EVA_DBSNP2_IMPORT_PIPELINE.md