- Name: Sarthak Gupta
- Project: Dynamically adding data to DRS
- Organization: Global Alliance for Genomics and Health
- Mentor: Alex Kanitz
This Gist summarises the work done by me during the 2020 Google Summer of Code, working on the ELIXIR Cloud & AAI ecosystem for the Global Alliance for Genomics and Health organization, mainly under the guidance of my mentor Alex Kanitz.
The project "Dynamically adding data to DRS" that I worked on was one of four projects of the Global Alliance for Genomics and Health selected for the 2020 Google Summer of Code.
The Global Alliance for Genomics and Health (GA4GH) is a "policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework."
Ref: GA4GH Website
The sharing of any type of data is done in a feasible way only when there are set of rules and policies established which specify how the data is exchanged. When these rules and policies are specified, it becomes easier and convenient for people to share data.
The Global Alliance for Genomics and Health (GA4GH) is helping to achieve this mission in the field of biomedical research. The sharing of data for biomedical research is of key importance in ensuring continued progress in our understanding of human health and wellbeing.
It establishes general frameworks to enable the sharing of genomic and clinical data and also catalyze data sharing projects that support this mission.
The GA4GH Cloud Work Stream develops API standards and implementations that make it possible to share data easily and in a consistent way.
There are currently four API standards that allow a person to:
- Share tools/workflows (Tool Registry Service; TRS)
- Interpret/run workflows (Workflow Execution Service; WES)
- Execute atomic computations (Task Execution Service; TES)
- Access data (Data Repository Service; DRS)
The ELIXIR Cloud & AAI ecosystem is a multinational initative that commits to implement the GA4GH Cloud APIs in an attempt to set up a federated cloud-based compute infrastructure for biomedical academic research and beyond. As such, they already have released relatively mature implementations for both WES (cwl-WES) and TES (TESK).
The goal of this project is to develop an open source, generic (i.e., not tied to any specific data provider), distributable and highly reusable DRS microservice implementation with diverse and unique use cases in the operationalization of the GA4GH Cloud Work Stream.
On analysing the already available microservices implemented by ELIXIR Cloud & AAI, it was found that many of them include code repetition across services. This lead to the idea of developing a microservice "archetype" providing tools and utilities for quickly developing microservices based on a defined tool stack.
Therefore, as the first part of the project, I was to design and implement this archetype, together with my mentor and two other ELIXIR Cloud & AAI interns. Upon completion, I would then use the archetype as a backbone for the development of the DRS implementation.
At the end of 3 months of eat-sleep-code-review-repeat, I contributed significantly to the following milestones:
- Release FOCA, which allows to quickly develop microservices based on OpenAPI, Flask and Connexion
- Release the DRS implementation DRS-Filer
- Release a client (DRS-cli) for the DRS-Filer
A lot of emphasis was put on following good coding practices, and so I am proud
to say that all of the above projects are well documented, have continuous
integration pipelines set up and are complete with unit tests covering basically
100%
of the code! 🌟
- Add Config Modules - Merged
- Add code coverage badge - Merged
- Add config_parser.py tests - Merged
- Add log config test - Merged
- Name change in different files - Merged
- Package project - Merged
- Adding database modules with tests - Merged
- Adding security modules with tests - Merged
- Update documentation - Merged
- Register MongoDB - Merged
- Class-based config handling with validation - Merged
- Support merging of multiple specs - Merged
- Improve support for index creation - Merged
- Init repository - Merged
- Create GET and POST endpoints with FOCA 0.3.0 - Merged
- Add POST and DELETE endpoints for objects - Merged
- Add DRS client code - Merged
Here I replaced some code in available ELIXIR Cloud & AAI repositories with generalized code in FOCA.
- Replace config handler with the external one - Merged
- Replace config with external one - Merged
Most of the milestones that were specified in the original proposal were covered. Few of the things which have not been covered are:
- A
PUT
endpoint in DRS-Filer is yet to be implemented. Given that the [DRS specification][ga4gh-drs-specs] do not specify this endpoint, and it does not add immediate value (the same outcome can be achieved through the application of thePOST
andDELETE
endpoints), it was decided that in order to better understand how such an endpoint could be used best, we should better collect real-world usage experiences with DRS-Filer. - Add Gunicorn support to FOCA, so that applications based on it can easily scale in production settings.
- Integration with TESK: as a strong emphasis was put on code quality and the exact usage of DRS in connection with WES and TES implementations is as yet still only vaguely defined by the GA4GH, these stretch goals could not be addressed in the given time frame.
One of the major goals of the ELIXIR Cloud & AAI ecosystem is the integration of all GA4GH Cloud Work Stream API components, TRS, WES, TES and DRS, such that data analysis workflows can be executed on any data in a decentralized, federated cloud infrastructure. The addition of DRS-Filer to the ELIXIR Cloud & AAI service stack represents another step towards this ambitious goal.
In my freshman year of university, I got introduced to open-source and fell in love with it. It became a pleasant way for me to code, learn and collaborate with people from all around the world. Since that time, I have tried to remain connected with it.
In late 2019, I was searching for a new organisation where I can contribute and learn more about the technologies that I was interested in. The Google Summer of Code archive page was the place where I started my search and came across the Global Alliance for Genomics and Health.
On the inital conversation, I was welcomed by my present mentor Alex Kanitz, and at that moment I decided that this would be the community that I will feel happy to contribute to. After some dicussions with my mentor, I started setting up FOCA. During these contributions, I learned to write great quality code and best practices that should be kept in mind by any developer.
Later I got selected for the project Dynamically adding data to DRS. 🥳
GSoC'20 has been one the best experiences of my life that I will remember for a long time to come. Over the last few months, apart from writing quality code, I have learned to take ownership of a project. I enjoyed the time working with my mentor and the community in general, and I thank them for giving me an amazing experience. Unlike most of the students who take part in GSoC, my journey with my organisation is not over yet and I feel there is still a long way to go...