Skip to content

Instantly share code, notes, and snippets.

@gluc
Last active June 19, 2023 02:32
Show Gist options
  • Save gluc/5f780246d57897b57c6b to your computer and use it in GitHub Desktop.
Save gluc/5f780246d57897b57c6b to your computer and use it in GitHub Desktop.
Convert a complex JSON to an R data.frame

This gist shows how to convert a nested JSON file to an R data.frame. To do this, it uses jsonlite and data.tree.

The gist contains two examples: one is a bit simpler, the second one a bit more advanced.

Example 1

In the first example, we download all the repos from Hadley Wickham's Github account from https://api.github.com/users/hadley/repos . This JSON contains a nested owner object. The code shows how to convert that in a flat data.frame in three statements:

  1. line 5: download
  2. line 8: convert to data.tree
  3. line 12: convert to data.frame

The basic idea is as follows:

  1. convert the JSON to a list of lists of lists, using jsonlite, avoiding simplification
  2. convert the list of lists to a data.tree. This structure is very similar to the semantic meaning of the JSON
  3. flatten the tree structure, using the various features of the data.tree package.

The main function to use in step 3 is the $ToDataFrameTable, which (conceptually) does two things:

  1. it traverses the leaves of the tree
  2. it then converts each leaf to a row in the data.frame. In more detail: a. fields of a node are mapped to columns in the data.frame b. if a field is not available in a leaf node, then ancestors are searched

Example 2

There is a few bells and whistles you can add to this. This is shown in the second example. It creates a data.frame containing all the contributors of repos for which hadley is the owner. We do this by extending the tree structure created in the above example: For each repo, we add a nested contributors node, such that the structure of our tree after executing line 36 will be:

root repo 1 owner contributors contributor 1 contributor 2 etc. repo 2 owner contributors etc.

Specifically, this example shows that:

  1. you can rename a field (see line 45)
  2. instead of mapping a field of a Node, you can execute a function (see line 47)
  3. you can filter which leaves you want to include in your data.frame (see line 54)
#devtools::install_github("gluc/data.tree")
library(data.tree)
library(jsonlite)
library(magrittr)
reposLoL <- fromJSON("https://api.github.com/users/hadley/repos", simplifyDataFrame = FALSE)
library(data.tree)
repos <- as.Node(reposLoL)
print(repos, "id", "login")
#convert this to a data.frame
reposdf <- repos$ToDataFrameTable(ownerId = "id",
"login",
repoName = function(x) x$parent$name, #relative to the leaf
fullName = "full_name", #unambiguous values are inherited from ancestors
repoId = function(x) x$parent$id,
"fork",
"type")
reposdf
#Now a somewhat more advanced example: In addition, let's download
#contributors to each repo and store them in the same tree
#NOTE: you can only call 50 unauthenticated api requests per hour,
#so you can run this only once
#Get can also call functions! Here, we are not so much interested in
#the result, but more as some sort of mapply
repos$Get(function(x) x$AddChild("contributors"), filterFun = function(x) x$level == 2)
getContribs <- function(x) {
contributors <- fromJSON(x$contributors_url, simplifyDataFrame = FALSE)
for(c in contributors) c %>% as.Node(nodeName = c$login) %>% x$Find("contributors")$AddChildNode()
return(length(contributors))
}
repos$Get(getContribs, filterFun = function(x) x$level == 2)
#optional: print some information about our structure
print(repos, "login", "id", "contributions")
repos$fieldsAll
#fields/attributes on contributors:
repos$Find("crantastic", "owner", "hadley")$fields
#convert it to a table (all attributes are relative to the leaves, i.e. contributors)
contributorsdf <- repos$ToDataFrameTable(contributorId = "id",
"login",
isOwner = function(x) x$login == x$parent$parent$Find("owner")$login,
ownerName = function(x) x$parent$parent$Find("owner")$login,
repoName = function(x) x$parent$parent$name,
repoId = function(x) x$parent$parent$id,
"fork",
"type",
"contributions",
filterFun = function(x) x$name != "owner"
)
@sensejoin
Copy link

Hello,

I am trying this example .. But i got error while executing the following statement. am i missing anything ? Please help.

rx=repos$DataframeTable(x="id")
Error: attempt to apply non-function

i am using following version.

version
_
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 2.3
year 2015
month 12
day 10
svn rev 69752
language R
version.string R version 3.2.3 (2015-12-10)
nickname Wooden Christmas-Tree

@gearoidobrien
Copy link

Hi sensejoin,

You need to run
reposdf <- repos %>% ToDataFrameTable(ownerId = "id", ...
instead of
reposdf <- repos$ToDataFrameTable(ownerId = "id", ...

@kubasiak
Copy link

kubasiak commented Dec 3, 2017

I like this one, it finally does what I want. It is amazing that in R which is desiged to handle data, there is no out-of-the box solution to this problem.
I have only one issue with this function- it is a bit slow. I have nearly a milion lines and several JSONs in each of them... and after a LONG calculations I get
Error in self[[child$name]] <- child :
attempt to use zero-length variable name

What can be a reason for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment