This Gist summarizes my work for the Reactome project "Query Reactome Data in R" during the Google Summer of Code 2020.
I would first like to begin by thanking my mentors Justin Cook, Joel Weiser, and Solomon Shorser for their guidance and support this summer. It was a privilege to learn from such a talented and experienced team.
The main objective was to create an interface in R to query data from the Reactome knowledgebase.
Reactome data are modeled as an interconnected graph and stored in the Reactome Graph Database. And Reactome has developed the Content Service, an Application Programming Interface (API) to provide access to the Graph Database. My mentors decided that we should create two R packages - one for the Content Service and the other one for the Graph Database - to keep the codebase logically divided.
-
R Interface for Reactome Content Service:
- Repository: reactome/ReactomeContentService4R
- Vignette: https://chilampoon.github.io/projects/ReactomeContentService4R.html
- Submitted to Bioconductor: https://bioconductor.org/packages/devel/bioc/html/ReactomeContentService4R.html
-
R Interface for Reactome Graph Database:
- Repository: reactome/ReactomeGraph4R
- Vignette: https://chilampoon.github.io/projects/ReactomeGraph4R.html
I have already explained the concepts and usages of functions in the vignettes in detail, therefore I just want to discuss some differences between the ReactomeContentService4R
and ReactomeGraph4R
packages here.
The Content Service is a REST API of the Graph Database with predefined endpoints determining what kind of curated data can be retrieved. ReactomeContentService4R
is thus an API wrapper for endpoints that were considered necessary by my mentors. It relies heavily on the httr
package to construct HTTP requests and access responses, while ReactomeGraph4R
is built on the R Neo4j driver neo4r
to interact with the local Neo4j graph database and access the graph data.
With different natures of the queries these two packages perform, they have different focuses. ReactomeContentService4R
is ideal for retrieving specific bits of information. For example, to get participants of a Reaction, or to fetch an orthologous instance of one species. On the other hand, ReactomeGraph4R
is suited to get the data with Reactome's inherent network structure, answering questions like what Reactions that precede/follow a certain Reaction are, what the Pathway>Reaction>PhysicalEntity>ReferenceEntity hierarchy that a Protein has in Reactome, etc.
Although using the Graph Database needs some extra setups, querying to the Graph Database is more flexible as the endpoints in the Content Service are fixed. The followings are what ReactomeGraph4R
can do but ReactomeContentService4R
can't:
- Fetch instances by displayName, the property of the node, and the relationship between nodes
- Find preceding and following Reactions of a specific Reaction
- Find referrals
- Visualize network graphs
There is also something that only ReactomeContentService4R
can do:
- Search for instances in a database-wide manner
- Export diagrams and files of Events (with analysis results overlaid)
It was much harder to develop the Graph Database package than the Content Service one to be honest, probably because neo4r
is still under development and I was unfamiliar with the Neo4j query language Cypher at the beginning. I ran into many issues when building ReaactomeGraph4R
. Most of the issues have been solved, but there are still two that prevail:
- Cannot handle single quote in character inputs
- Issue #16, it's from
neo4r
- Issue #16, it's from
- Cannot parse path data (e.g.
MATCH p=()-[r:output]->() RETURN p
) indata.parsing.R
- Enhancing the data parsing functions may solve it
And one thing that I think is less smart is the way to process row data results in ReaactomeGraph4R
(issue #13). The true relationship data are only in the "graph" output from neo4r
with the relationship type, the start node, and the end node. Therefore, I designed to retrieve results of neo4r
in both "row" and "graph" formats when type = "row"
is specified in any MATCH function in ReactomeGraph4R
. Then the relationships information is integrated from "graph" data into the original "row" data.
The reason why I didn't abandon the "row" format is that it would divide the results into different R objects according to nodes that are specified in the RETURN Cypher clause, while the "graph" format mixes all nodes in an R object. A smarter solution might be adding more internal functions to handle returned JSON data only in "graph" format from neo4r
, making the result to be suitable for displaying both in text and graph visualization.
- More unit tests using
testthat
- OS support tests for Linux, macOS, and Windows
I had to let go of these objectives due to time constraints. Though I did multiple tests and the R CMD CHECK
each time I added a new feature, also the functions are further tested by my mentors and maybe someone else, it's good to add a little more automatic checks. Anyway, I am looking forward to the possibility of adding some new features and bug fixing for the packages in the future.
- An opportunity to contribute to Reactome:
(Reactome)-[:gsoc]-(me)
- An opportunity to develop R package, from 0 to 2!
- Experience in version control, building R packages, querying data in Neo4j, etc.
- A lot of fun 😸
- A most productive summer ☀️
- An updated understanding of open source: I used to think that open source means the source code is open for everyone to view and use (from the perspective of a biology student), but it's actually also open for everyone to contribute. Also thanks to those overwhelming social media posts shared by GSoC connections on Linkedin...
Presented a poster at BOSC 2021: https://doi.org/10.7490/f1000research.1118690.1