I wrote a series of 4 blogposts about this topic. For more detail please refer to:
-
Part 1: how I go about creating a synthetic dataset, and import that into Neo4j
-
Part 2: how I can start running some interesting queries on the dataset, making me understand some of the interesting data points in there and questions that one might ask
-
Part 3: how I can use graph data science on this dataset, and understand some of the predictive metrics like pagerank, betweenness and use community detection to direct policies
-
Part 4: a number of loose ends that I touched on during my exploration - but surely did not exhaust.
There’s so much potential in this dataset, and in this problem domain in general. I feel like I have gone into the rabbit hole and have just resurfaced for some air. But who knows, maybe I will dive back in and do some more digging - after all, this is interesting stuff, and I love working on interesting topics.
In this guide I will show you the statements in an orderly sequence. Let’s start.
Here’s the full spreadsheet with synthetic data. The csv files are to be found here:
Now we can use these scripts to load the data.
All of these scripts are also on also on github: github
System requirements: these queries have been tested on Neo4j Enterprise 3.5.17 and 4.0.3, apoc 3.5.0.9 and 4.0.0.6 respectively, and gds 1.1.
Import the persons from the Person sheet:
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=0" as csv
create (p:Person {id: csv.PersonId, name:csv.PersonName, healthstatus:csv.Healthstatus, confirmedtime:datetime(csv.ConfirmedTime), addresslocation:point({x: toFloat(csv.AddressLat), y: toFloat(csv.AddressLong)})});
Import the places from the places worksheet:
load csv with headers from
"https://docs.google.com/spreadsheets/u/0/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=205425553" as csv
create (p:Place {id: csv.PlaceId, name:csv.PlaceName, type:csv.PlaceType, location:point({x: toFloat(csv.Lat), y: toFloat(csv.Long)})});
Create a couple of indexes to make easier/faster to create the Visit nodes and relationships:
create index on :Place(id);
create index on :Place(location);
create index on :Place(name);
create index on :Person(id);
create index on :Person(name);
create index on :Person(healthstatus);
create index on :Person(confirmedtime);
Import the VISITS from the Visits sheet. Note that we are loading duplicate info here: both with a VISIT node, and a VISITS relationship. They can both be useful.
load csv with headers from
"https://docs.google.com/spreadsheets/d/1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE/export?format=csv&id=1R-XVuynPsOWcXSderLpq3DacZdk10PZ8v6FiYGTncIE&gid=1261126668" as csv
match (p:Person {id:csv.PersonId}), (pl:Place {id:csv.PlaceId})
create (p)-[:PERFORMS_VISIT]->(v:Visit {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)})-[:LOCATED_AT]->(pl)
create (p)-[vi:VISITS {id:csv.VisitId, starttime:datetime(csv.StartTime), endtime:datetime(csv.EndTime)}]->(pl)
set v.duration=duration.inSeconds(v.starttime,v.endtime)
set vi.duration=duration.inSeconds(vi.starttime,vi.endtime);
OPTIONAL: connect places to a Region, Country, Continent
create (r:Region {name:"Antwerp"})-[:PART_OF]->(c:Country {name:"Belgium"})-[:PART_OF]->(co:Continent {name:"Europe"});
match (r:Region {name:"Antwerp"}), (pl:Place)
create (pl)-[:PART_OF]->(r);
System requirements: these queries have been tested on Neo4j Enterprise 3.5.17 and 4.0.3, apoc 3.5.0.9 and 4.0.0.6 respectively, and gds 1.1.
The queries are also on github.
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match (p)--(v1:Visit)--(pl:Place)--(v2:Visit)--(p2:Person {healthstatus:"Healthy"})
return p.name as Spreader, v1.starttime as SpreaderStarttime, v2.endtime as SpreaderEndtime, pl.name as PlaceVisited, p2.name as Target, v2.starttime as TargetStarttime, v2.endtime as TargetEndttime;
Who has a sick person potentially infected - VISUAL
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-->(v1:Visit)-->(pl:Place)<--(v2:Visit)<--(p2:Person {healthstatus:"Healthy"})
return path;
match (p:Person {healthstatus:"Sick"})
with p
limit 1
match path = (p)-[:VISITS]->(pl:Place)<-[:VISITS]-(p2:Person {healthstatus:"Healthy"})
return path;
The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
Note that at the time or writing, apoc.coll.min and apoc.coll.max do not work on apoc 4.0.0.7 or later. Please use apoc version 4.0.0.6 which you can find over here
match (p:Person {healthstatus:"Sick"})-->(v1:Visit)-->(pl:Place)
with p,v1,pl
limit 10
match path = (p)-->(v1)-->(pl)<--(v2:Visit)<--(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return path;
Who has a sick person infected - with time overlap AND SIMPLIFIED with the VISITS relationship. The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
match (p:Person {healthstatus:"Sick"})-[v1:VISITS]->(pl:Place)
with p,v1,pl
limit 10
match path = (p)-[v1]->(pl)<-[v2:VISITS]-(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return path;
Who has a sick person infected - with time overlap +/- 2hrs. The latest of start times must occur before (or at the same time) as the earliest of the end times for the ranges to overlap.
match (p:Person {healthstatus:"Sick"})-->(s1:Stay)-->(pl:Place)
with p,s1,pl
limit 10
match path = (p)-->(s1)-->(pl)<--(s2:Stay)<--(p2:Person {healthstatus:"Healthy"})
WITH path, apoc.coll.max([s1.starttime.epochMillis, s2.starttime.epochMillis]) as maxStart,
apoc.coll.min([s1.endtime.epochMillis, s2.endtime.epochMillis]) as minEnd
where maxStart-720000 <= minEnd+720000
return path;
match (p:Person {healthstatus:"Sick"})-[visited]->(pl:Place)
where p.confirmedtime < visited.starttime
return p, visited, pl
limit 10;
match (p1:Person {healthstatus:"Sick"}),(p2:Person {healthstatus:"Sick"})
where id(p1)<id(p2)
with p1, p2
match path = allshortestpaths ((p1)-[*]-(p2))
return path
limit 10;
match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
return hp.name, hp.healthstatus, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc;
Which healthy person has the highest risk - based on amount over overlaptime with sick people - VISUAL
match (hp:Person {healthstatus:"Healthy"})-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(sp:Person {healthstatus:"Sick"})
with hp, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with hp, sum(minEnd-maxStart) as overlaptime
order by overlaptime desc
limit 10
match (hp)-[v]-(pl:Place)
return hp,v,pl;
match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, apoc.node.degree.in(pl,'VISITS') as totalnrofvisits
order by nrofsickvisits desc
limit 10
return placename, nrofsickvisits, totalnrofvisits, round(toFloat(nrofsickvisits)/toFloat(totalnrofvisits)*10000)/100 as percentageofsickvisits;
Places with most sick visits - VISUAL
match (p:Person {healthstatus:"Sick"})-[v:VISITS]->(pl:Place)
with distinct pl.name as placename, count(v) as nrofsickvisits, pl
order by nrofsickvisits desc
limit 10
match (pl)<-[v]-(p:Person)
return pl,p,v;
Note that at the time of writing, these queries have been tested on Neo4j 3.5.17. Neo4j 4.0.3 currently not yet supports the GDS plugin.
All the scripts are of course also on github
This is a relationship between two PERSON nodes that we will need for our graph data science exercises.
match (p1:Person)-[v1:VISITS]->(pl:Place)<-[v2:VISITS]-(p2:Person)
where id(p1)<id(p2)
with p1, p2, apoc.coll.max([v1.starttime.epochMillis, v2.starttime.epochMillis]) as maxStart,
apoc.coll.min([v1.endtime.epochMillis, v2.endtime.epochMillis]) as minEnd
where maxStart <= minEnd
with p1, p2, sum(minEnd-maxStart) as meetTime
create (p1)-[:MEETS {meettime: duration({seconds: meetTime/1000})}]->(p2);
:param limit => (10);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
relationshipWeightProperty: null,
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});
CALL gds.pageRank.write($config);
Look at the Person pagerank table results:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node[$config.writeProperty] AS pagerank, node.betweenness as betweenness
ORDER BY pagerank DESC
LIMIT 10;
Look at the Person pagerank graph results VISUALLY:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
match (node)-[r]-(conn)
return node, r, conn
:param limit => (20);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {}
}
},
writeProperty: 'betweenness'
});
CALL gds.alpha.betweenness.write($config);
Look at the Person betweenness results table:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
RETURN node.name as name, node.pagerank as pagerank, node[$config.writeProperty] AS betweenness
ORDER BY betweenness DESC
LIMIT 10;
Look at the Person betweenness results VISUALLY:
MATCH (node)
WHERE not(node[$config.writeProperty] is null)
with node, node[$config.writeProperty] AS score
ORDER BY score DESC
LIMIT 10
match (node)-[r]-(conn)
return node, r, conn;
Preparation for relationship weight property: needs integer, is currently set up as a duration!
MATCH p=()-[r:MEETS]->()
set r.meettimeinseconds=r.meettime.seconds;
Now we can calculate communities using Louvain:
:param limit => ( 50);
:param config => ({
nodeProjection: 'Person',
relationshipProjection: {
relType: {
type: 'MEETS',
orientation: 'NATURAL',
properties: {
meettimeinseconds: {
property: 'meettimeinseconds',
defaultValue: 1
}
}
}
},
relationshipWeightProperty: 'meettimeinseconds',
includeIntermediateCommunities: false,
seedProperty: '',
writeProperty: 'louvain'
});
CALL gds.louvain.write($config);
What are the different communities?
match (p:Person)
return distinct p.louvain, count(p)
order by count(p) desc;
Explore community 489:
match (p1:Person {louvain: 489})-[v:VISITS]->(pl:Place), (p1)-[m:MEETS]->(p2:Person)
return p1, p2, pl, v, m;