New release of Neo4j 4.3 came out. One of the key features are relationship property indexes - a really interesting feature.
Two main points of attention:
- Performance improvements: all of a sudden the Neo4j Cypher query planner is going to be able to use a lot more information, provided by these relationship indexes. The planner is becoming smarter - and therefore queries will become faster. We will explore this below.
- Modelling implications: the introduction of these indexes will have far-reaching implications with regards to how we model certain things. More options are good, of course!
Similar to the work I did last year on contact tracing. Take a look at (http://blog.bruggen.com/2020/06/what-recommender-systems-and-contact.html) to see how that went.
- Using the faker plugin
- Download it from github page. Install is super easy. Just need to make sure the config is updated too - whitelisted
fkr.*
just like we do withgds.*
andapoc.*
.
Only difference: pushing the scale up to the size of my home city of Antwerp, Belgium.
Need to have enough memory - but should be able to do it in one transaction.
foreach (i in range(1,500000) |
create (p:Person { id : i })
set p += fkr.person('1940-01-01','2021-06-01')
set p.healthstatus = fkr.stringElement("Sick,Healthy")
set p.confirmedtime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
set p.birthDate = datetime(p.birthDate)
set p.addresslocation = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
set p.name = p.fullName
remove p.fullName
);
Adding the places is instantaneous:
foreach (i in range(1,10000) |
create (p:Place { id: i, name: "Place nr "+i})
set p.type = fkr.stringElement("Grocery shop,Theater,Restaurant,School,Hospital,Mall,Bar,Park")
set p.location = point({x: toFloat(51.210197+rand()/100), y: toFloat(4.402771+rand()/100)})
);
Don't really need them for this demo - but could be useful for other queries.
CREATE INDEX placenodeid FOR (p:Place) ON (p.id);
CREATE INDEX placenodelocation FOR (p:Place) ON (p.location);
CREATE INDEX placenodename FOR (p:Place) ON (p.name);
CREATE INDEX personnodeid FOR (p:Person) ON (p.id);
CREATE INDEX personnodenam FOR (p:Person) ON (p.name);
CREATE INDEX personnodehealthstatus FOR (p:Person) ON (p.healthstatus);
CREATE INDEX personnodeconfirmedtime FOR (p:Person) ON (p.confirmedtime);
Using periodic committing of transactions. 89 seconds is not bad!
CALL apoc.periodic.iterate(
'with range(1,1500000) as range
unwind range as iteration return iteration',
'match (p:Person {id: toInteger(rand()*500000)+1}), (pl:Place {id:toInteger(rand()*10000)+1 })
create (p)-[:PERFORMS_VISIT]->(v:Visit { id: iteration})-[:LOCATED_AT]->(pl)
create (p)-[virel:VISITS]->(pl)
set v.starttime = datetime()-duration("P"+toInteger(round(rand()*100))+"DT"+toInteger(round(rand()*10))+"H")
set virel.starttime = v.starttime
set v.endtime = v.starttime + duration("PT"+toInteger(round(rand()*10))+"H"+toInteger(round(rand()*60))+"M")
set virel.endtime = v.endtime
set v.visittime = duration.between(v.starttime,v.endtime)
set v.visittimeinseconds = v.visittime.seconds
set virel.visittime = v.visittime
set virel.visittimeinseconds = v.visittimeinseconds',
{batchSize:25000, parallel:false});
The randomisation makes for some people to stay unconnected. Not a problem - real life that would also be the case, right? Some people just don't go out :) ...
match (p:Person)
where not ((p)--())
return count(p);
CREATE INDEX visitnodestarttime FOR (v:Visit) ON (v.starttime);
profile match (p:Person)-[:PERFORMS_VISIT]->(v:Visit)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
The profile / query plan before or after the index are very different:
- using
NodeByLabelScan
: lots of db hits. - using
NodeIndexSeekByRange
: making the performance fly From 4403ms to 7ms.
Now we can actually forget about the intermediat (:Visit)
nodes, and just use the [:VISITS]
relationships.
Very similar to add the index to the relationship property:
CREATE INDEX visitrelstarttime FOR ()-[v:VISITS]->() ON (v.starttime);
Now we can run the equivalent query on the new model.
This is what that query looks like:
profile match (p:Person)-[v:VISITS]->(pl:Place)
where v.starttime > datetime()-duration("P20DT17H")
and v.starttime < datetime()-duration("P20DT10H")
return p.name, sum(v.visittime) as totalvisittime, sum(v.visittimeinseconds) as totalvisittimeinseconds
order by totalvisittime desc
limit 10;
- Without the relationship index: using
NodeByLabelScan
, causing lots of db hits and 6 seconds of waiting. - With the relationship index: using
DirectedRelationshipIndexSeekByRange
- dropping the db hits and decimating the wait time to less than 8 millseconds.
Great performance, and a simpler model.
Rik Van Bruggen