Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save rvanbruggen/c181f8b1319997d684877172f10270bb to your computer and use it in GitHub Desktop.
Save rvanbruggen/c181f8b1319997d684877172f10270bb to your computer and use it in GitHub Desktop.

Revisiting contact tracing with Neo4j 4.3's relationship indexes

New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.

Two main points of attention:

  1. Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
  2. Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!

Create a synthetic contact tracing graph - size of Antwerp

Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.

  • Using the faker plugin
  • Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted fkr.* just like we do with gds.* and apoc.*.

Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.

Create 500000 persons

Need to have enough memory - but should be able to do it in one transaction.

foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

Create 10000 places

Adding the places is instantaneous:

foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);

Put in places some indexes on the NODES

Don't really need them for this demo - but could be useful for other queries.

CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);

Add 1500000 random visits to places

Using periodic committing of transactions. 89 seconds is not bad!

CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});

Some people will be unconnected

The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...

match (p:Person)
where not ((p)--())
return count(p);

Querying for starttimes using OLD model / node indexes

Index the visit nodes

CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);

Query on visit nodes

profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

The profile / query plan before or after the index are very different:

  • using NodeByLabelScan: lots of db hits.
  • using NodeIndexSeekByRange: making the performance fly From 4403ms to 7ms.

Querying for starttimes using NEW model

Now we can actually forget about the intermediat (:Visit) nodes, and just use the [:VISITS] relationships.

Index the VISITS relationships

Very similar to add the index to the relationship property:

CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

Now we can run the equivalent query on the new model.

Query on VISITS relationships

This is what that query looks like:

profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
  • Without the relationship index: using NodeByLabelScan, causing lots of db hits and 6 seconds of waiting.
  • With the relationship index: using DirectedRelationshipIndexSeekByRange - dropping the db hits and decimating the wait time to less than 8 millseconds.

Conclusion:

Great performance, and a simpler model.

Rik Van Bruggen

Revisiting contact tracing with Neo4j 4.3's relationship indexes

New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.

Two main points of attention:

  1. Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
  2. Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!

Create a synthetic contact tracing graph - size of Antwerp

Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.

  • Using the faker plugin
  • Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted fkr.* just like we do with gds.* and apoc.*.

Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.

Create 500000 persons

Need to have enough memory - but should be able to do it in one transaction.

foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

Create 10000 places

Adding the places is instantaneous:

foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);

Put in places some indexes on the NODES

Don't really need them for this demo - but could be useful for other queries.

CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);


Add 1500000 random visits to places

Using periodic committing of transactions. 89 seconds is not bad!

CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});

Some people will be unconnected

The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...

match (p:Person)
where not ((p)--())
return count(p);


Querying for starttimes using OLD model / node indexes

Index the visit nodes

CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);

Query on visit nodes

profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

The profile / query plan before or after the index are very different:

  • using NodeByLabelScan: lots of db hits.
  • using NodeIndexSeekByRange: making the performance fly From 4403ms to 7ms.

Querying for starttimes using NEW model

Now we can actually forget about the intermediat (:Visit) nodes, and just use the [:VISITS] relationships.

Index the VISITS relationships

Very similar to add the index to the relationship property:

CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

Now we can run the equivalent query on the new model.

Query on VISITS relationships

This is what that query looks like:

profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
  • Without the relationship index: using NodeByLabelScan, causing lots of db hits and 6 seconds of waiting.
  • With the relationship index: using DirectedRelationshipIndexSeekByRange - dropping the db hits and decimating the wait time to less than 8 millseconds.

Conclusion:

Great performance, and a simpler model.

Rik Van Bruggen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment