Import Distinct Data

Import Distinct Data from a CSV file

Here are two methods for importing only distinct data from a CSV file, without having to pre-process the CSV for duplicates. Method 2 is more efficient than Method 1.

Graph Population Method 1

This method uses MERGE to ensure that duplicate Person nodes are not created.

It matches on just the id, making it a faster match than matching on all the person properties (id, name and age). If the node is created then name and age are set. If the node already exists then a property on the existing node, called duplicates, is set to true. This is just to illustrate that duplicates were found when using MERGE.

CREATE INDEX ON :Person(id);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/2900504/people.csv" AS line

WITH line
MERGE (p:Person {id: toInt(line.Id)})
ON CREATE
  SET p.name = line.Name, p.age = line.Age
ON MATCH
  SET p.duplicates = true

Resulting Person nodes

Note that there are two Person nodes with the same name and age, but different ids, therefore they are not duplicate Person nodes.

MATCH (p:Person)
RETURN p.id AS Id, p.name AS Name, p.age AS Age, p.duplicates AS Duplicates

MATCH (p)
DETACH DELETE p

Graph Population Method 2

This method is more efficient as it never tries to match any duplicates from the csv file as they are filtered out beforehand. It still uses MERGE to ensure that duplicate Person nodes are not created, but in this situation this would only be required if the csv file was loaded more than once.

CREATE INDEX ON :Person(id);

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/2900504/people.csv" AS line

WITH DISTINCT line
MERGE (p:Person {id: toInt(line.Id)})
ON CREATE
  SET p.name = line.Name, p.age = line.Age
ON MATCH
  SET p.duplicates = true

Resulting Person nodes

Looking at the table below will show you that MERGE never finds any duplicates, and never runs ON MATCH, as the duplicates property is never set to true. Note that there are two Person nodes with the same name and age, but different ids, therefore they are not duplicate Person nodes.

MATCH (p:Person)
RETURN p.id AS Id, p.name AS Name, p.age AS Age, p.duplicates AS Duplicates

Notes

CREATE INDEX ON :Person(id); provides for faster searching when matching on id.

USING PERIODIC COMMIT 1000 is used to ensure that memory is not filled up before the results of this load are committed to the database.

Content of the csv file

Id,Name,Age 1,Adam,21 2,Jane,65 3,Mary,17 1,Adam,21 2,Jane,65 4,Philip,40 5,Mary,17

tekiegirl/ImportDistinctData.adoc

Import Distinct Data