Finding Duplicate Relationships

Introduction to Problem

Graph databases and Neo4j particularly, are excellent for a different set of purposes, such as storing, traversing and presenting graphy data (Graphs are Everywhere!), but let’s face it: too much freedom and not enough discipline might bring bad data quality into the formula. A common problem results when creating nodes and properties without indexes or when improperly using MERGE instead of CREATE.

In this essay I’ll explore generic methods for identifying duplicate relationships. Feel free to use it and comment for improvements in the approach.

Load Sample Data

Let’s put together some data first. And then a few duplicates as well.

CREATE
  (father:Person {name: 'Tarzan'}),
  (mother:Person {name: 'Jane'}),
  (son:Person {name: 'Boy'}),
  (cousin: Person {name: 'Jimmy'}),
  (uncle: Person {name: 'Teddy'})

CREATE
  (father)-[:FATHER_OF]->(son),
  (mother)-[:MOTHER_OF]->(son),
  (father)-[:HUSBAND_OF]->(mother),
  (son)-[:RELATIVE_OF {type: 'cousin'}]->(cousin),
  (son)-[:RELATIVE_OF {type: 'uncle'}]->(uncle),
  (son)-[:RELATIVE_OF {type: 'friend'}]->(cousin)

// Now for the 'unfortunate' duplicates
CREATE
  (father)-[:FATHER_OF]->(son),
  (father)-[:HUSBAND_OF]->(mother),
  (son)-[:RELATIVE_OF {type: 'uncle'}]->(uncle)
;

We have created 6 relationships without duplicates. For the purposes of this exercise we consider a relationship to be a duplicate if it has the same type and the same property values. Notice that in the first batch there are 2 relationships than even if they don’t make sense, they are technically not duplicates as the value in the property is different.

After the first batch, we introduce deliberately 3 duplicates.

This is how it looks graphically

It's difficult to see duplicate relationship visually unless you happen to use the new neo4j browser (version 2.2 or higher)

However even this image doesn’t tell us the whole story unless we happen to inspect the relationship properties. Let’s elaborate.

1st attempt: these are potential duplicates

The basic approach is taken directly from a SQL-alike query. Notice how we use generic queries without labels or relationship types in order to explore the results.

MATCH
  (n1)-[r]->(n2)
RETURN
  str(n1), type(r) as type_r, str(n2), count(*) as count_r

We can’t clearly observe while inspecting the results that any count > 1 is a candidate for duplication. But we can do better:

2nd attempt

MATCH
  (n1)-[r]->(n2)
WITH
  n1, type(r) as type_r, n2, count(*) as count_r
WHERE
  count_r > 1
MATCH
  (n1)-[r]->(n2)
WHERE
  type(r) = type_r
RETURN
  str(n1), str(r), str(n2)

When inspecting these results it is clear the first two ('Boy' and 'Jimmy') are no duplicates. Everything else is. However, how can we eliminate those 2 rows from the result?

3rd attempt and a little cheating

Currently there is not a way of obtaining the values of a node (or relationship) by using cypher only. So we have to do some cheating here:

MATCH
  (n1)-[r]->(n2)
WITH
  n1, type(r) as type_r, n2, count(*) as count_r
WHERE
  count_r > 1
MATCH
  (n1)-[r]->(n2)
WHERE
  type(r) = type_r
WITH
  n1, n2, type_r, split(split(str(r),"{")[1],"}")[0] as prop_r, count(*) as count_r
WHERE
  count_r > 1
RETURN
  str(n1), type_r, str(n2), prop_r

Now we have reached our objective. The rows in the table are the true duplicates in the data set. However, a few caveats remain:

A relationship string is being parsed instead of reducing a map of keys and values.
The order of keys cannot be guaranteed but you get the idea. You can taylor for your particular scenario.
I’m using the split function however if your data contains the characters '{}' the query breaks.
It is too SQL-ish: only the syntax has been transformed from what could be a relational approach. Can we do better?

4th attempt (current) and still cheating

When looking back at the query the syntax looks identical to that of a SQL query: selecting, grouping and counting. But neo4j and cypher are meant to be more expressive than that. The whole concept of "pattern matching" brings us to the next level. Duplicate relationships are actually loops in a graph. Look back into the neo4j browser image to see it.

MATCH
  (n1)-[r1]->(n2)<-[r2]-(n1)
WHERE
  type(r1)=type(r2) AND
  split(split(str(r1),"{")[1],"}")[0] = split(split(str(r2),"{")[1],"}")[0] // still cheating
RETURN // use DISTINCT when expecting lots of duplicates
  str(n1), type(r1) as type_r, split(split(str(r1),"{")[1],"}")[0] as prop_r, str(n2)

Given that this is an expensive query, you can obtain a little improvement filtering out the 1st batch of duplicate candidates:

MATCH
  (n1)-[r1]->(n2)<-[r2]-(n1)
WHERE
  type(r1)=type(r2)
WITH
  n1, r1, n2, r2
WHERE
  split(split(str(r1),"{")[1],"}")[0] = split(split(str(r2),"{")[1],"}")[0] // cheating
RETURN // use DISTINCT when expecting lots of duplicates
  str(n1), type(r1) as type_r, split(split(str(r1),"{")[1],"}")[0] as prop_r, str(n2)

I don’t recommend trying to do automatic cleaning unless you have properly tested the data quality queries. But if your scenario requires it, you can always refer to the max (or min) id within a duplicate set and delete. Hopefully this query will save you time when looking for duplicate relationships in your neo4j database.

The beauty of this approach is that you can use this query in your existing database and it will check for duplicate relationships that conform to our definition of duplicate. Try it and let me know what you encounter.

I’ll keep updating this post by either refactoring the queries or introducing newer cypher features. Please leave a comment if you think the queries can be improved and I’ll update the post.

ernestoe/duplicate_relationships.adoc

Finding Duplicate Relationships

Introduction to Problem

Load Sample Data

This is how it looks graphically

1st attempt: these are potential duplicates

2nd attempt

3rd attempt and a little cheating

4th attempt (current) and still cheating

ogoldberg commented Mar 10, 2023 •

edited

Loading

ernestoe/duplicate_relationships.adoc

Finding Duplicate Relationships

Introduction to Problem

Load Sample Data

This is how it looks graphically

1st attempt: these are potential duplicates

2nd attempt

3rd attempt and a little cheating

4th attempt (current) and still cheating

ogoldberg commented Mar 10, 2023 • edited Loading

ogoldberg commented Mar 10, 2023 •

edited

Loading