rvanbruggen/1-browser_guide-contacttracing_with_relationship_indexes.mdx

Last active June 22, 2021 09:36

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/rvanbruggen/c181f8b1319997d684877172f10270bb.js"></script>
Save rvanbruggen/c181f8b1319997d684877172f10270bb to your computer and use it in GitHub Desktop.

Raw

1-browser_guide-contacttracing_with_relationship_indexes.mdx

Revisiting contact tracing with Neo4j 4.3's relationship indexes

New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.

Two main points of attention:

Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!

Create a synthetic contact tracing graph - size of Antwerp

Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.

Using the faker plugin
Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted fkr.* just like we do with gds.* and apoc.*.

Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.

Create 500000 persons

Need to have enough memory - but should be able to do it in one transaction.

foreach (i in range(1,500000) |
    create (p:Person { id : i })
    set p += fkr.person('1940-01-01','2021-06-01')
    set p.healthstatus = fkr.stringElement("Sick,Healthy")
    set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
    set p.birthDate = datetime(p.birthDate)
    set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
    set p.name = p.fullName
    remove p.fullName
);

Create 10000 places

Adding the places is instantaneous:

foreach (i in range(1,10000) |
    create (p:Place { id: i, name: "Place nr "+i})
    set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
    set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);

Put in places some indexes on the NODES

Don't really need them for this demo - but could be useful for other queries.

CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);

Add 1500000 random visits to places

Using periodic committing of transactions. 89 seconds is not bad!

CALL apoc.periodic.iterate(
    'with range(1,1500000) as range
        unwind range as iteration return iteration', 
    'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
        create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
        create (p)-[virel:VISITS]->(pl)
        set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
        set virel.starttime = v.starttime
        set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
        set virel.endtime = v.endtime
        set v.visittime = duration.between(v.starttime,v.endtime)
        set v.visittimeinseconds = v.visittime.seconds
        set virel.visittime = v.visittime
        set virel.visittimeinseconds = v.visittimeinseconds', 
    {batchSize:25000, parallel:false});

Some people will be unconnected

The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...

match (p:Person)
where not ((p)--())
return count(p);

Querying for starttimes using OLD model / node indexes

Index the visit nodes

CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);

Query on visit nodes

profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

The profile / query plan before or after the index are very different:

using NodeByLabelScan: lots of db hits.
using NodeIndexSeekByRange: making the performance fly From 4403ms to 7ms.

Querying for starttimes using NEW model

Now we can actually forget about the intermediat (:Visit) nodes, and just use the [:VISITS] relationships.

Index the VISITS relationships

Very similar to add the index to the relationship property:

CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);

Now we can run the equivalent query on the new model.

Query on VISITS relationships

This is what that query looks like:

profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;

Without the relationship index: using NodeByLabelScan, causing lots of db hits and 6 seconds of waiting.
With the relationship index: using DirectedRelationshipIndexSeekByRange - dropping the db hits and decimating the wait time to less than 8 millseconds.

Conclusion:

Great performance, and a simpler model.

Rik Van Bruggen

Raw

2-contacttracing_with_relationship_indexes.md