Here are two methods for importing only distinct data from a CSV file, without having to pre-process the CSV for duplicates. Method 2 is more efficient than Method 1.
This method uses MERGE to ensure that duplicate Person nodes are not created.
It matches on just the id, making it a faster match than matching on all the person properties (id, name and age). If the node is created then name and age are set. If the node already exists then a property on the existing node, called duplicates, is set to true. This is just to illustrate that duplicates were found when using MERGE.
CREATE INDEX ON :Person(id);
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/2900504/people.csv" AS line
WITH line
MERGE (p:Person {id: toInt(line.Id)})
ON CREATE
SET p.name = line.Name, p.age = line.Age
ON MATCH
SET p.duplicates = true
This method is more efficient as it never tries to match any duplicates from the csv file as they are filtered out beforehand. It still uses MERGE to ensure that duplicate Person nodes are not created, but in this situation this would only be required if the csv file was loaded more than once.
CREATE INDEX ON :Person(id);
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "https://dl.dropboxusercontent.com/u/2900504/people.csv" AS line
WITH DISTINCT line
MERGE (p:Person {id: toInt(line.Id)})
ON CREATE
SET p.name = line.Name, p.age = line.Age
ON MATCH
SET p.duplicates = true
Looking at the table below will show you that MERGE never finds any duplicates, and never runs ON MATCH, as the duplicates property is never set to true. Note that there are two Person nodes with the same name and age, but different ids, therefore they are not duplicate Person nodes.
MATCH (p:Person)
RETURN p.id AS Id, p.name AS Name, p.age AS Age, p.duplicates AS Duplicates
CREATE INDEX ON :Person(id);
provides for faster searching when matching on id.
USING PERIODIC COMMIT 1000
is used to ensure that memory is not filled up before the results of this load are committed to the database.