Importing a CSV of Categories in one Pass

My colleague Rik van Bruggen wrote a wonderful blog post about importing the Google Merchant Category Hierarchy into Neo4j.

I enjoyed reading it and then wondered if we can’t import it in a single run, when we treat the CSV file just as collection of columns which we can access using Cypher’s collection functions.

Having downloaded and installed and started, Neo4j, let’s download the CSV data, so we have it available locally.

curl https://gist.githubusercontent.com/rvanbruggen/adaac7415eb42912e950/raw/taxonomy.csv -o /tmp/taxonomy.csv

Then we can use the neo4j-shell to import the data. Just start it with bin/neo4j-shell and paste the statements as you go.

Data Checking

How many rows?

load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return count(*);

Check the Data

load csv with headers from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
return row limit 3;

Look at the bare rows

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return row limit 3;

Convert CSV into simpler format

We treat the row as collection of strings this time. First col is id, then names up to first null

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
// skip header
with row skip 1
return toInt(row[0]) as id,
// keep non null columns
     filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
limit 3;

Duplicate Checking

Let’s check for duplicate names by taking the last non-null category-name, aggregating on it, counting the occurrences and reporting those that have more than one.

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with row skip 1
with filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
// split into two WITHs for readability
with names[-1] as name, count(*) as c
where c > 1
return name,c;

Import

Now let’s set up a constraint on :Cat(id) and an index on :Cat(name) and create the "Root" category.

create constraint on (c:Cat) assert c.id is unique;
create index on :Cat(name);
create (:Cat {id:0,name:"Root"});

My colleague Mark Needham had the idea to convert he full denormalized tree into the minimal data we need to create the hiearchy.

We’re only interested in the leaf which we have to create with id and name. And then connect it to its direct parent, the rest of the hierarchy was already created in previous rows.

load csv from "file:/tmp/taxonomy.csv" as row fieldterminator ';'
with toInt(row[0]) as id,
     filter(name in row[1..] WHERE NOT (name is null or name = "")) as names
skip 1 {import-limit}
WITH id,
     coalesce(names[-2],"Root") as parentName,
	 names[-1] as childName
MERGE (child:Cat {id:id}) ON CREATE SET child.name = childName
WITH child,parentName
MATCH (parent:Cat {name:parentName})
CREATE (parent)<-[:PART_OF]-(child);

pbamotra/product_hierarchies.adoc

Importing a CSV of Categories in one Pass

Data Checking

Convert CSV into simpler format

Duplicate Checking

Import