Github Events Analysis with Neo4j

On July 22, Github announced the 3rd Annual Github Data Challenge presenting multiple sources of data available.

This sounded to me a good opportunity to use their available data and import it in Neo4j in order to have a lot of fun at analyzing the data that fits naturally in a graph.

As I work mainly offline or behind military proxies that do not permit me to use the ReST API, I decided to go for the Github Archive available here, you can then download json files representing Github Events on a daily/hour basis.

Be aware that hourly files represent approximately 8000 events !

Each file contain lines of Github Events encoded in JSON. There are a bunch of EventTypes like ForkEvent, IssueCommentEvent, ReleaseEvent, DeleteEvent, WatchEvent, and so on ...

You can find the complete list of EventTypes and Payloads here.

For my part, I will use 4 EventTypes : ForkEvent, PushEvent, PullRequestEvent & IssueCommentEvent.

This maybe looks little but if like me you import all these 4 events for a period of 1 month, you'll end up with about 2 million nodes created and 4 times more relationships.

Importing the data

I've created a simple library in PHP that parses these data files and import the Events into the database.

Each Event is handled with a dedicated EventTypeLoader that produce the Cypher Query for the import.

The library is available here : https://github.com/kwattro/gh4j

I would like to clear up that the way events are imported are proper to my intentions of data manipulating, you can always modify the code to reflect your needs.

Also, be aware that it is not compatible with the ReST API events as the payload are totally different.

Informations specific to all Events

There is Common Payload for all Events, the User creating the event and the EventType, each Event will create a Node having the name of the event as a Neo4j label.

The PullRequestEvent

By examinating the PullRequestEvent Payload, you can recreate the complete schema, starting from the user creating the PullRequest ending to who is owning the Repository on which the PR is opened.

You can also guess when the PR is closed if it is a MERGE/CLOSE or a single CLOSE by looking at the merged_at key.

The PullRequestEventLoader will import the payload in the following schema :

The Fork Event

Same here, ForkEvent Payload gives you some informations but not so much, for e.g. this does not give you the id of the newly created Repository (a Fork is also a repository).

So in order to have a uniqueness reference for matching/creating/retrieving forks, I decided to use html urls of the Fork/Repository as an identifier and this information is available in the payload.

Loading the ForkEvent will produce the following schema :

The PushEvent

As the ForkEvent, the PushEvent does not provide quite useful informations, you can still build a little schema.

This will produce the following Neo4j Graph Schema :

The IssueCommentEvent

And the last one, IssueCommentEvent. There is a check in the code of the IssueCommentEventLoader that detect if the Issue on which the comment is done is related to a PullRequest, this will when true create a BOUND_TO_PR relationship from Issue to PR.

Giving up the schema :

Let's play

Ok, the boring part of explaining how the data is imported is done. If you've uploaded a bit data for 1 or 2 days you can end up with already a really intersting graph.

Time is now to leverage Neo4j godness called Cypher and start having fun (means querying the graph) :)

Basic queries

Who did the most events ?

MATCH (u:User)-[r:DO]->()
RETURN u.name, count(r) as events
ORDER BY events DESC
LIMIT 1
---------
u.name	events
imclab	273
Returned 1 row in 2755 ms

Which repository has been the most touched

MATCH (repo:Repository)<-[r]-()
RETURN repo.name, count(r) as touchs
ORDER BY touchs DESC
LIMIT 1
---------
repo.name	touchs
dotfiles	624
Returned 1 row in 3105 ms

Which repository has been the most forked

MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1
---------
repo.name	forks
malloclab	81
Returned 1 row in 558 ms

Which repository has the most merge PR's

MATCH (repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)<-[merge:PR_MERGE]-()
RETURN repo.name, count(merge) as merges
ORDER BY merges DESC
LIMIT 1
---------
repo.name	merges
homebrew-cask	19
Returned 1 row in 329 ms

As you can see, it is simply with Cypher as you just need to draw on a paper or whiteboard the path you want to return.

Ok we go a bit further now :

Less-basic queries

(At least for my level ;-) )

How many Forks are resulting in an Opened PR ?

You just need to analyse the nodes and connections between the ForkEvent and the PR Event and translate it into Cypher.

Here is a visual representation of the complete Fork -> PR_OPEN scheme :

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)
----------
count(p)
1410
Returned 1 row in 1484 ms

What is the average time in seconds between a Fork is done and a PR is opened?

There is not much changes with the previous query, we just use the avg function on the difference between the PR Opening time and the Fork time

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]
->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p), avg(pre.time - fe.time) as offsetTime
---------
count(p)	offsetTime
1410	    3593.6758865248225
Returned 1 row in 1114 ms

We can extend the query to know the average time until the PR is merged.

What is the average time in seconds between a Fork and a PR is merged?

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]->
(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest {state:'merged'})-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
<-[:PR_CLOSE]-(pr3:PullRequestEvent)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p) as matchedPaths, avg(pr3.time - fe.time) as offsetTime
---------
matchedPaths	offsetTime
408	          8177.544117647059
Returned 1 row in 928 ms

What is the average number of comments on a PR before the PR is merged?

MATCH p=(ice:IssueCommentEvent)-[:ISSUE_COMMENT]->(comment:IssueComment)
-[:COMMENT_ON]->(issue:Issue)-[:BOUND_TO_PR]->(pr:PullRequest)
<-[:PR_MERGE]-(pre:PullRequestEvent)
WHERE ice.time <= pre.time
WITH pr, count(comment) as comments
RETURN avg(comments)
---------
avg(comments)
2.0440251572327046
Returned 1 row in 713 ms

Which User has the most MERGED PR's on Repositories not owned by him?

MATCH (u:User)-[r:DO]->(fe:PullRequestEvent)-[:PR_OPEN]->(pr:PullRequest {state:'merged'})
-[:PR_ON_REPO]-(repo:Repository)-[:OWNED_BY]->(u2:User)
WHERE NOT u = u2
RETURN u.name, count(r) as prs
ORDER BY prs DESC
LIMIT 1
--------
u.name	prs
adamralph	14
Returned 1 row in 365 ms

Relate together Users having Merged PR's on same repositories, could serve as Follow Recommendations Engine.

This will also create a weight property on the relationship incremented each time a relation between the 2 users is found

MATCH p=(u:User)-[:DO]-(e:PullRequestEvent)-->(pr:PullRequest {state:'merged'})-[:PR_ON_REPO]->(r:Repository)<-[:PR_ON_REPO]-(pr2:PullRequest {state:'merged'})--(e2:PullRequestEvent)<-[:DO]-(u2:User)
WHERE NOT u = u2
WITH nodes(p) as coll
WITH head(coll) as st, last(coll) as end
MERGE (st)-[r:HAVE_WORKED_ON_SAME_REPO]-(end)
ON MATCH SET r.w = (r.w) + 1
ON CREATE SET r.w = 1
---------
Created 1122 relationships, returned 0 rows in 2334 ms

Very impressive. Thanks for writing this up so thoroughly.

For this query:

Which repository has been the most forked

MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1

You can remove the middle identifier.

MATCH (repo:Repository)<-[:FORK_OF]-()<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1

Assuming that the FORK_OF relationship type and FORK relationship type only connect to nodes with the label Fork. Trivial feedback here, as the change doesn't provide a performance benefit, but it reads nicer.

For this query:

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)

You can separate the long pattern into a set of anchored patterns like this:

MATCH p=(u:User)-[:DO]->(fe:ForkEvent),
                  (fe)-[:FORK]->(fork:Fork),
                  (fork)-[:FORK_OF]->(repo:Repository),
                  (repo)<-[:PR_ON_REPO]-(pr:PullRequest),
                  (pr)-[:PR_OPEN]-(pre:PullRequestEvent),
                  (pre)<-[:DO]-(u2:User),
                  (u2)<-[:OWNED_BY]-(f2:Fork),
                  (f2)<-[:BRANCH_OF]-(br:Branch),
                  (br)<-[:FROM_BRANCH]-(pr2:PullRequest),
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)

There are multiple ways to optimize the performance of this query. By first matching on users who both forked and opened a pull request. This will provide a limited starting criteria to prevent a global graph pattern matching operation. Check out my blog post on Neo4j Cypher Query Evaluation: http://www.kennybastani.com/2014/07/understanding-how-neo4j-cypher-queries.html

Amazing work here again. I'd love to see this put into a user interface. Take a look at a project I built to visualize behavioral analytics using charts and time series: http://meetup-analytics-dashboard.herokuapp.com/ and https://github.com/kbastani/meetup-analytics

Cheers!

Kenny

ikwattro/GithubEventAnalysisWithNeo4j.markdown

Github Events Analysis with Neo4j

Importing the data

Informations specific to all Events

The PullRequestEvent

The Fork Event

The PushEvent

The IssueCommentEvent

Let's play

Basic queries

Who did the most events ?

Which repository has been the most touched

Which repository has been the most forked

Which repository has the most merge PR's

Less-basic queries

How many Forks are resulting in an Opened PR ?

What is the average time in seconds between a Fork is done and a PR is opened?

What is the average time in seconds between a Fork and a PR is merged?

What is the average number of comments on a PR before the PR is merged?

Which User has the most MERGED PR's on Repositories not owned by him?

Relate together Users having Merged PR's on same repositories, could serve as Follow Recommendations Engine.

kbastani commented Aug 21, 2014

Uh oh!

ikwattro commented Aug 21, 2014

Uh oh!