On July 22, Github announced the 3rd Annual Github Data Challenge presenting multiple sources of data available.
This sounded to me a good opportunity to use their available data and import it in Neo4j in order to have a lot of fun at analyzing the data that fits naturally in a graph.
As I work mainly offline or behind military proxies that do not permit me to use the ReST API, I decided to go for the Github Archive available here, you can then download json files representing Github Events on a daily/hour basis.
Be aware that hourly files represent approximately 8000 events !
Each file contain lines of Github Events encoded in JSON.
There are a bunch of EventTypes
like ForkEvent, IssueCommentEvent, ReleaseEvent, DeleteEvent, WatchEvent, and so on ...
You can find the complete list of EventTypes and Payloads here.
For my part, I will use 4 EventTypes : ForkEvent, PushEvent, PullRequestEvent & IssueCommentEvent.
This maybe looks little but if like me you import all these 4 events for a period of 1 month, you'll end up with about 2 million nodes created and 4 times more relationships.
I've created a simple library in PHP that parses these data files and import the Events into the database.
Each Event is handled with a dedicated EventTypeLoader that produce the Cypher Query for the import.
The library is available here : https://github.com/kwattro/gh4j
I would like to clear up that the way events are imported are proper to my intentions of data manipulating, you can always modify the code to reflect your needs.
Also, be aware that it is not compatible with the ReST API events as the payload are totally different.
There is Common Payload for all Events, the User creating the event and the EventType, each Event will create a Node having the name of the event as a Neo4j label.
By examinating the PullRequestEvent Payload, you can recreate the complete schema, starting from the user creating the PullRequest ending to who is owning the Repository on which the PR is opened.
You can also guess when the PR is closed if it is a MERGE/CLOSE or a single CLOSE by looking at the merged_at key.
The PullRequestEventLoader
will import the payload in the following schema :
Same here, ForkEvent Payload gives you some informations but not so much, for e.g. this does not give you the id of the newly created Repository (a Fork is also a repository).
So in order to have a uniqueness reference for matching/creating/retrieving forks, I decided to use html urls of the Fork/Repository as an identifier and this information is available in the payload.
Loading the ForkEvent will produce the following schema :
As the ForkEvent, the PushEvent does not provide quite useful informations, you can still build a little schema.
This will produce the following Neo4j Graph Schema :
And the last one, IssueCommentEvent. There is a check in the code of the IssueCommentEventLoader that detect if the Issue on which the comment is done is related to a PullRequest, this will when true create a BOUND_TO_PR relationship from Issue to PR.
Giving up the schema :
Ok, the boring part of explaining how the data is imported is done. If you've uploaded a bit data for 1 or 2 days you can end up with already a really intersting graph.
Time is now to leverage Neo4j godness called Cypher and start having fun (means querying the graph) :)
MATCH (u:User)-[r:DO]->()
RETURN u.name, count(r) as events
ORDER BY events DESC
LIMIT 1
---------
u.name events
imclab 273
Returned 1 row in 2755 ms
MATCH (repo:Repository)<-[r]-()
RETURN repo.name, count(r) as touchs
ORDER BY touchs DESC
LIMIT 1
---------
repo.name touchs
dotfiles 624
Returned 1 row in 3105 ms
MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1
---------
repo.name forks
malloclab 81
Returned 1 row in 558 ms
MATCH (repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)<-[merge:PR_MERGE]-()
RETURN repo.name, count(merge) as merges
ORDER BY merges DESC
LIMIT 1
---------
repo.name merges
homebrew-cask 19
Returned 1 row in 329 ms
As you can see, it is simply with Cypher as you just need to draw on a paper or whiteboard the path you want to return.
Ok we go a bit further now :
(At least for my level ;-) )
You just need to analyse the nodes and connections between the ForkEvent and the PR Event and translate it into Cypher.
Here is a visual representation of the complete Fork -> PR_OPEN scheme :
MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)
----------
count(p)
1410
Returned 1 row in 1484 ms
There is not much changes with the previous query, we just use the avg function on the difference between the PR Opening time and the Fork time
MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]
->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p), avg(pre.time - fe.time) as offsetTime
---------
count(p) offsetTime
1410 3593.6758865248225
Returned 1 row in 1114 ms
We can extend the query to know the average time until the PR is merged.
MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]->
(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest {state:'merged'})-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
<-[:PR_CLOSE]-(pr3:PullRequestEvent)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p) as matchedPaths, avg(pr3.time - fe.time) as offsetTime
---------
matchedPaths offsetTime
408 8177.544117647059
Returned 1 row in 928 ms
MATCH p=(ice:IssueCommentEvent)-[:ISSUE_COMMENT]->(comment:IssueComment)
-[:COMMENT_ON]->(issue:Issue)-[:BOUND_TO_PR]->(pr:PullRequest)
<-[:PR_MERGE]-(pre:PullRequestEvent)
WHERE ice.time <= pre.time
WITH pr, count(comment) as comments
RETURN avg(comments)
---------
avg(comments)
2.0440251572327046
Returned 1 row in 713 ms
MATCH (u:User)-[r:DO]->(fe:PullRequestEvent)-[:PR_OPEN]->(pr:PullRequest {state:'merged'})
-[:PR_ON_REPO]-(repo:Repository)-[:OWNED_BY]->(u2:User)
WHERE NOT u = u2
RETURN u.name, count(r) as prs
ORDER BY prs DESC
LIMIT 1
--------
u.name prs
adamralph 14
Returned 1 row in 365 ms
Relate together Users having Merged PR's on same repositories, could serve as Follow Recommendations Engine.
This will also create a weight property on the relationship incremented each time a relation between the 2 users is found
MATCH p=(u:User)-[:DO]-(e:PullRequestEvent)-->(pr:PullRequest {state:'merged'})-[:PR_ON_REPO]->(r:Repository)<-[:PR_ON_REPO]-(pr2:PullRequest {state:'merged'})--(e2:PullRequestEvent)<-[:DO]-(u2:User)
WHERE NOT u = u2
WITH nodes(p) as coll
WITH head(coll) as st, last(coll) as end
MERGE (st)-[r:HAVE_WORKED_ON_SAME_REPO]-(end)
ON MATCH SET r.w = (r.w) + 1
ON CREATE SET r.w = 1
---------
Created 1122 relationships, returned 0 rows in 2334 ms
Very impressive. Thanks for writing this up so thoroughly.
For this query:
Which repository has been the most forked
You can remove the middle identifier.
Assuming that the
FORK_OF
relationship type andFORK
relationship type only connect to nodes with the labelFork
. Trivial feedback here, as the change doesn't provide a performance benefit, but it reads nicer.For this query:
You can separate the long pattern into a set of anchored patterns like this:
There are multiple ways to optimize the performance of this query. By first matching on users who both forked and opened a pull request. This will provide a limited starting criteria to prevent a global graph pattern matching operation. Check out my blog post on Neo4j Cypher Query Evaluation: http://www.kennybastani.com/2014/07/understanding-how-neo4j-cypher-queries.html
Amazing work here again. I'd love to see this put into a user interface. Take a look at a project I built to visualize behavioral analytics using charts and time series: http://meetup-analytics-dashboard.herokuapp.com/ and https://github.com/kbastani/meetup-analytics
Cheers!
Kenny