GSoC 2020 Final Report

This Gist summarizes my work for the Reactome project "Query Reactome Data in R" during the Google Summer of Code 2020.

I would first like to begin by thanking my mentors Justin Cook, Joel Weiser, and Solomon Shorser for their guidance and support this summer. It was a privilege to learn from such a talented and experienced team.

Main objective

The main objective was to create an interface in R to query data from the Reactome knowledgebase.

Reactome data are modeled as an interconnected graph and stored in the Reactome Graph Database. And Reactome has developed the Content Service, an Application Programming Interface (API) to provide access to the Graph Database. My mentors decided that we should create two R packages - one for the Content Service and the other one for the Graph Database - to keep the codebase logically divided.

Code and documentation

R Interface for Reactome Content Service:
- Repository: reactome/ReactomeContentService4R
- Vignette: https://chilampoon.github.io/projects/ReactomeContentService4R.html
- Submitted to Bioconductor: https://bioconductor.org/packages/devel/bioc/html/ReactomeContentService4R.html
R Interface for Reactome Graph Database:
- Repository: reactome/ReactomeGraph4R
- Vignette: https://chilampoon.github.io/projects/ReactomeGraph4R.html

Differences between the two packages

I have already explained the concepts and usages of functions in the vignettes in detail, therefore I just want to discuss some differences between the ReactomeContentService4R and ReactomeGraph4R packages here.

The Content Service is a REST API of the Graph Database with predefined endpoints determining what kind of curated data can be retrieved. ReactomeContentService4R is thus an API wrapper for endpoints that were considered necessary by my mentors. It relies heavily on the httr package to construct HTTP requests and access responses, while ReactomeGraph4R is built on the R Neo4j driver neo4r to interact with the local Neo4j graph database and access the graph data.

With different natures of the queries these two packages perform, they have different focuses. ReactomeContentService4R is ideal for retrieving specific bits of information. For example, to get participants of a Reaction, or to fetch an orthologous instance of one species. On the other hand, ReactomeGraph4R is suited to get the data with Reactome's inherent network structure, answering questions like what Reactions that precede/follow a certain Reaction are, what the Pathway>Reaction>PhysicalEntity>ReferenceEntity hierarchy that a Protein has in Reactome, etc.

Although using the Graph Database needs some extra setups, querying to the Graph Database is more flexible as the endpoints in the Content Service are fixed. The followings are what ReactomeGraph4R can do but ReactomeContentService4R can't:

Fetch instances by displayName, the property of the node, and the relationship between nodes
Find preceding and following Reactions of a specific Reaction
Find referrals
Visualize network graphs

There is also something that only ReactomeContentService4R can do:

Search for instances in a database-wide manner
Export diagrams and files of Events (with analysis results overlaid)

Shortcomings

It was much harder to develop the Graph Database package than the Content Service one to be honest, probably because neo4r is still under development and I was unfamiliar with the Neo4j query language Cypher at the beginning. I ran into many issues when building ReaactomeGraph4R. Most of the issues have been solved, but there are still two that prevail:

Cannot handle single quote in character inputs
- Issue #16, it's from neo4r
Cannot parse path data (e.g. MATCH p=()-[r:output]->() RETURN p) in data.parsing.R
- Enhancing the data parsing functions may solve it

And one thing that I think is less smart is the way to process row data results in ReaactomeGraph4R (issue #13). The true relationship data are only in the "graph" output from neo4r with the relationship type, the start node, and the end node. Therefore, I designed to retrieve results of neo4r in both "row" and "graph" formats when type = "row" is specified in any MATCH function in ReactomeGraph4R. Then the relationships information is integrated from "graph" data into the original "row" data.

The reason why I didn't abandon the "row" format is that it would divide the results into different R objects according to nodes that are specified in the RETURN Cypher clause, while the "graph" format mixes all nodes in an R object. A smarter solution might be adding more internal functions to handle returned JSON data only in "graph" format from neo4r, making the result to be suitable for displaying both in text and graph visualization.

Future works

More unit tests using testthat
OS support tests for Linux, macOS, and Windows

I had to let go of these objectives due to time constraints. Though I did multiple tests and the R CMD CHECK each time I added a new feature, also the functions are further tested by my mentors and maybe someone else, it's good to add a little more automatic checks. Anyway, I am looking forward to the possibility of adding some new features and bug fixing for the packages in the future.

What I gained from GSoC

An opportunity to contribute to Reactome: (Reactome)-[:gsoc]-(me)
An opportunity to develop R package, from 0 to 2!
Experience in version control, building R packages, querying data in Neo4j, etc.
A lot of fun 😸
A most productive summer ☀️
An updated understanding of open source: I used to think that open source means the source code is open for everyone to view and use (from the perspective of a biology student), but it's actually also open for everyone to contribute. Also thanks to those overwhelming social media posts shared by GSoC connections on Linkedin...

chilampoon/GSoC_Final_Report.md