Graph databases and Neo4j particularly, are excellent for a different set of purposes, such as storing, traversing and presenting graphy data (Graphs are Everywhere!), but let’s face it: too much freedom and not enough discipline might bring bad data quality into the formula. A common problem results when creating nodes and properties without indexes or when improperly using MERGE instead of CREATE.
In this essay I’ll explore generic methods for identifying duplicate relationships. Feel free to use it and comment for improvements in the approach.
Let’s put together some data first. And then a few duplicates as well.
CREATE
(father:Person {name: 'Tarzan'}),
(mother:Person {name: 'Jane'}),
(son:Person {name: 'Boy'}),
(cousin: Person {name: 'Jimmy'}),
(uncle: Person {name: 'Teddy'})
CREATE
(father)-[:FATHER_OF]->(son),
(mother)-[:MOTHER_OF]->(son),
(father)-[:HUSBAND_OF]->(mother),
(son)-[:RELATIVE_OF {type: 'cousin'}]->(cousin),
(son)-[:RELATIVE_OF {type: 'uncle'}]->(uncle),
(son)-[:RELATIVE_OF {type: 'friend'}]->(cousin)
// Now for the 'unfortunate' duplicates
CREATE
(father)-[:FATHER_OF]->(son),
(father)-[:HUSBAND_OF]->(mother),
(son)-[:RELATIVE_OF {type: 'uncle'}]->(uncle)
;
We have created 6 relationships without duplicates. For the purposes of this exercise we consider a relationship to be a duplicate if it has the same type and the same property values. Notice that in the first batch there are 2 relationships than even if they don’t make sense, they are technically not duplicates as the value in the property is different.
After the first batch, we introduce deliberately 3 duplicates.
The basic approach is taken directly from a SQL-alike query. Notice how we use generic queries without labels or relationship types in order to explore the results.
MATCH
(n1)-[r]->(n2)
RETURN
str(n1), type(r) as type_r, str(n2), count(*) as count_r
We can’t clearly observe while inspecting the results that any count > 1 is a candidate for duplication. But we can do better:
MATCH
(n1)-[r]->(n2)
WITH
n1, type(r) as type_r, n2, count(*) as count_r
WHERE
count_r > 1
MATCH
(n1)-[r]->(n2)
WHERE
type(r) = type_r
RETURN
str(n1), str(r), str(n2)
When inspecting these results it is clear the first two ('Boy' and 'Jimmy') are no duplicates. Everything else is. However, how can we eliminate those 2 rows from the result?
Currently there is not a way of obtaining the values of a node (or relationship) by using cypher only. So we have to do some cheating here:
MATCH
(n1)-[r]->(n2)
WITH
n1, type(r) as type_r, n2, count(*) as count_r
WHERE
count_r > 1
MATCH
(n1)-[r]->(n2)
WHERE
type(r) = type_r
WITH
n1, n2, type_r, split(split(str(r),"{")[1],"}")[0] as prop_r, count(*) as count_r
WHERE
count_r > 1
RETURN
str(n1), type_r, str(n2), prop_r
Now we have reached our objective. The rows in the table are the true duplicates in the data set. However, a few caveats remain:
-
A relationship string is being parsed instead of reducing a map of keys and values.
-
The order of keys cannot be guaranteed but you get the idea. You can taylor for your particular scenario.
-
I’m using the split function however if your data contains the characters '{}' the query breaks.
-
It is too SQL-ish: only the syntax has been transformed from what could be a relational approach. Can we do better?
When looking back at the query the syntax looks identical to that of a SQL query: selecting, grouping and counting. But neo4j and cypher are meant to be more expressive than that. The whole concept of "pattern matching" brings us to the next level. Duplicate relationships are actually loops in a graph. Look back into the neo4j browser image to see it.
MATCH
(n1)-[r1]->(n2)<-[r2]-(n1)
WHERE
type(r1)=type(r2) AND
split(split(str(r1),"{")[1],"}")[0] = split(split(str(r2),"{")[1],"}")[0] // still cheating
RETURN // use DISTINCT when expecting lots of duplicates
str(n1), type(r1) as type_r, split(split(str(r1),"{")[1],"}")[0] as prop_r, str(n2)
Given that this is an expensive query, you can obtain a little improvement filtering out the 1st batch of duplicate candidates:
MATCH
(n1)-[r1]->(n2)<-[r2]-(n1)
WHERE
type(r1)=type(r2)
WITH
n1, r1, n2, r2
WHERE
split(split(str(r1),"{")[1],"}")[0] = split(split(str(r2),"{")[1],"}")[0] // cheating
RETURN // use DISTINCT when expecting lots of duplicates
str(n1), type(r1) as type_r, split(split(str(r1),"{")[1],"}")[0] as prop_r, str(n2)
I don’t recommend trying to do automatic cleaning unless you have properly tested the data quality queries. But if your scenario requires it, you can always refer to the max (or min) id within a duplicate set and delete. Hopefully this query will save you time when looking for duplicate relationships in your neo4j database.
The beauty of this approach is that you can use this query in your existing database and it will check for duplicate relationships that conform to our definition of duplicate. Try it and let me know what you encounter.
I’ll keep updating this post by either refactoring the queries or introducing newer cypher features. Please leave a comment if you think the queries can be improved and I’ll update the post.
I'm pretty happy with this one: