My colleague Rik van Bruggen wrote a wonderful blog post about importing the Google Merchant Category Hierarchy into Neo4j.
I enjoyed reading it and then wondered if we can’t import it in a single run, when we treat the CSV file just as collection of columns which we can access using Cypher’s collection functions.
Having downloaded and installed and started, Neo4j, let’s download the CSV data, so we have it available locally.
curl https://gist.githubusercontent.com/rvanbruggen/adaac7415eb42912e950/raw/taxonomy.csv -o /tmp/taxonomy.csv
Then we can use the neo4j-shell
to import the data. Just start it with bin/neo4j-shell
and paste the statements as you go.
load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return count(*);
load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return row limit 3;
load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return row limit 3;
We treat the row as collection of strings this time. First col is id, then names up to first null
load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return toInt(row[0]) as id,
// keep non null columns
filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
limit 3;
Let’s check for duplicate names by taking the last non-null category-name, aggregating on it, counting the occurrences and reporting those that have more than one.
load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with row skip 1
with filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
// split into two WITHs for readability
with names[-1] as name, count(*) as c
where c > 1
return name,c;
Now let’s set up a constraint on :Cat(id)
and an index on :Cat(name)
and create the "Root" category.
create constraint on (c:Cat) assert c.id is unique;
create index on :Cat(name);
create (:Cat {id:0,name:"Root"});
My colleague Mark Needham had the idea to convert he full denormalized tree into the minimal data we need to create the hiearchy.
We’re only interested in the leaf which we have to create with id and name. And then connect it to its direct parent, the rest of the hierarchy was already created in previous rows.
load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with toInt(row[0]) as id,
filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
skip 1 {import-limit}
WITH id,
coalesce(names[-2],"Root") as parentName,
names[-1] as childName
MERGE (child:Cat {id:id}) ON CREATE SET child.name = childName
WITH child,parentName
MATCH (parent:Cat {name:parentName})
CREATE (parent)<-[:PART_OF]-(child);