Skip to content

Instantly share code, notes, and snippets.

@108krohan
Last active December 5, 2019 10:51
Show Gist options
  • Save 108krohan/64baa69d6017976ad39bf6a664a85bc6 to your computer and use it in GitHub Desktop.
Save 108krohan/64baa69d6017976ad39bf6a664a85bc6 to your computer and use it in GitHub Desktop.
2019 Google Summer of Code project report

Data pipeline for exchange of human genomic variation between public repositories

as part of 2019 Google Summer of Code

About the organisation

The Global Alliance for Genomics and Health (GA4GH)

The Global Alliance for Genomics and Health (GA4GH) helps accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.

European Variation Archive (EVA)

The European Variation Archive (EVA) is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser.

About the project

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute.

The project implements a mechanism to make EVA be in sync with the latest human data submitted to dbSNP. Given a dbSNP FTP directory with the human variant information, the pipeline parses JSONs for each of the 24 human chromosomes and writes the variants from the JSONs into the EVA archive.

Once imported to the EVA archive, this information can be distributed via EVA implementations of the GA4GH APIs htsget and Beacon specifications, as well as the EVA website - and available for public use.

High level diagram

high-level-diagram

Tasks

The pipeline is a Spring Batch job application with three simple steps: 1) Fetch an input JSON line, 2) Process the JSON to preserve only the required variant information, 3) Persist the information into a data store. The application repeats these steps for all the records in a given .json.bz2 file. Along with this, there are listeners in place for logging/tracing purpose.

Concretely

  • Construct an object model for dbSNP 2.0 JSON and parse the JSONs to objects in that model

    • Related PR

    • Summary The input source .json.bz2 is uncompressed on the fly. Using a JSON processing library Jackson, the processor prepares a plain old java object model. The processor returns null in case the variant needs to be ignored because of some reason (for example, during an assembly accession and RefSeq accession mismatch). Table summary of field derivation

      Field JSON tree
      Accession refsnp_id
      Taxonomy accession 9606 (ref)
      Assembly accession primary_snapshot_data > placements_with_allele (list) > placement_annot > seq_id_traits_by_assembly (list - 1st element) > assembly_accession
      Contig primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > seq_id
      Start primary_snapshot_data > placements_with_allele (list) > alleles (list - 1st element) > allele > spdi > position
      Type primary_snapshot_data > variant_type
      Created date create_date
  • Perform desired business logic on the model, handle edge-cases

  • Persist the model as a document in a MongoDB collection

    • Related PR
    • Summary In the final step, the variant writer persists the derived information into the MongoDB data store (with collection name, DbsnpClusteredVariantEntity). Duplicate records are identified and logged for reporting to NCBI

Team

Code

Code source

Project wikis

PRs merged

Technologies

  • Java 8 development kit
  • Spring batch framework batch processing
  • MongoDB datastore
  • Maven build automation tool
  • Jackson JSON processor

Challenges

Data

Key challenges for me were understanding the structure of the deeply nested JSON inputs. Understanding the VCFs, their attributes - among others. Analysis of sample JSONs proved to be really fruitful as it helped me code manageable unit tests.

An extensive analysis of the data input can be accessed here. This sheet also provides sample analysis and result data verification by comparing this pipeline's results from Ensembl (link).

Edge cases

As I was new to Perl, studying the Ensembl code - and matching their results with the new import pipeline was also challenging and at the same time exciting to have completed.

Edge cases documented in separate wiki (link).

Prior assumptions also checked for all chromosomes that are to be ingested in the pipeline. You may find the python script here.

End credits

Infinite thanks to Sundar and Cristina for their continued guidance, mentorship and feedback to drive the project to completion. Also grateful to Ensembl team, and in particular, Helen, for providing answers to many of the wall-of-text doubts I had at the beginning. Thank you to Global Alliance for Genomics and Health (GA4GH), and Google for the opportunity to contribute to bioinformatics, and open source.

References

  • The Global Alliance for Genomics and Health (GA4GH) (link)
  • The European Variation Archive (link)
  • 2019 Google Summer of Code project page (link)
  • Schema + Sample Analysis (link)
  • Python script to check assumptions (link)
  • Ensemble perl script (link)
  • Wiki edge case discussion (link)
  • Wiki project (link)
  • NCBI dbsnp 2.0 Schema API spec (link)
  • NCBI ftp input (link)
  • NCBI ftp chromosome variations input source (link)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment