Skip to content

Instantly share code, notes, and snippets.

@lancelot1969
Forked from gluc/Desc_JSON_to_df.md
Created March 17, 2018 13:29
Show Gist options
  • Save lancelot1969/3eedc6a7ada711308be6459e6dc58829 to your computer and use it in GitHub Desktop.
Save lancelot1969/3eedc6a7ada711308be6459e6dc58829 to your computer and use it in GitHub Desktop.
Convert a complex JSON to an R data.frame

This gist shows how to convert a nested JSON file to an R data.frame. To do this, it uses jsonlite and data.tree.

The gist contains two examples: one is a bit simpler, the second one a bit more advanced.

Example 1

In the first example, we download all the repos from Hadley Wickham's Github account from https://api.github.com/users/hadley/repos . This JSON contains a nested owner object. The code shows how to convert that in a flat data.frame in three statements:

  1. line 5: download
  2. line 8: convert to data.tree
  3. line 12: convert to data.frame

The basic idea is as follows:

  1. convert the JSON to a list of lists of lists, using jsonlite, avoiding simplification
  2. convert the list of lists to a data.tree. This structure is very similar to the semantic meaning of the JSON
  3. flatten the tree structure, using the various features of the data.tree package.

The main function to use in step 3 is the $ToDataFrameTable, which (conceptually) does two things:

  1. it traverses the leaves of the tree
  2. it then converts each leaf to a row in the data.frame. In more detail: a. fields of a node are mapped to columns in the data.frame b. if a field is not available in a leaf node, then ancestors are searched

Example 2

There is a few bells and whistles you can add to this. This is shown in the second example. It creates a data.frame containing all the contributors of repos for which hadley is the owner. We do this by extending the tree structure created in the above example: For each repo, we add a nested contributors node, such that the structure of our tree after executing line 36 will be:

root repo 1 owner contributors contributor 1 contributor 2 etc. repo 2 owner contributors etc.

Specifically, this example shows that:

  1. you can rename a field (see line 45)
  2. instead of mapping a field of a Node, you can execute a function (see line 47)
  3. you can filter which leaves you want to include in your data.frame (see line 54)
#devtools::install_github("gluc/data.tree")
library(data.tree)
library(jsonlite)
library(magrittr)
reposLoL <- fromJSON("https://api.github.com/users/hadley/repos", simplifyDataFrame = FALSE)
library(data.tree)
repos <- as.Node(reposLoL)
print(repos, "id", "login")
#convert this to a data.frame
reposdf <- repos$ToDataFrameTable(ownerId = "id",
"login",
repoName = function(x) x$parent$name, #relative to the leaf
fullName = "full_name", #unambiguous values are inherited from ancestors
repoId = function(x) x$parent$id,
"fork",
"type")
reposdf
#Now a somewhat more advanced example: In addition, let's download
#contributors to each repo and store them in the same tree
#NOTE: you can only call 50 unauthenticated api requests per hour,
#so you can run this only once
#Get can also call functions! Here, we are not so much interested in
#the result, but more as some sort of mapply
repos$Get(function(x) x$AddChild("contributors"), filterFun = function(x) x$level == 2)
getContribs <- function(x) {
contributors <- fromJSON(x$contributors_url, simplifyDataFrame = FALSE)
for(c in contributors) c %>% as.Node(nodeName = c$login) %>% x$Find("contributors")$AddChildNode()
return(length(contributors))
}
repos$Get(getContribs, filterFun = function(x) x$level == 2)
#optional: print some information about our structure
print(repos, "login", "id", "contributions")
repos$fieldsAll
#fields/attributes on contributors:
repos$Find("crantastic", "owner", "hadley")$fields
#convert it to a table (all attributes are relative to the leaves, i.e. contributors)
contributorsdf <- repos$ToDataFrameTable(contributorId = "id",
"login",
isOwner = function(x) x$login == x$parent$parent$Find("owner")$login,
ownerName = function(x) x$parent$parent$Find("owner")$login,
repoName = function(x) x$parent$parent$name,
repoId = function(x) x$parent$parent$id,
"fork",
"type",
"contributions",
filterFun = function(x) x$name != "owner"
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment