Skip to content

Instantly share code, notes, and snippets.

@rvanbruggen
Last active March 30, 2021 20:01
Show Gist options
  • Save rvanbruggen/fe0d2f0f6769a563755f0e6ab3870b60 to your computer and use it in GitHub Desktop.
Save rvanbruggen/fe0d2f0f6769a563755f0e6ab3870b60 to your computer and use it in GitHub Desktop.
# Clickstream Analysis with Neo4j

1. Loading the data

start time

Return time();```
 
### create index before merging
```cypher
CREATE CONSTRAINT ON (n:Page) ASSERT n.title IS UNIQUE;

load from the tsv (note that the " have been replaced by "")

Call apoc.periodic.iterate('
  Call apoc.load.csv("file:///clickstream-enwiki-2021-02.tsv", {sep:"TAB", header:false}) yield list as line return line',
    'MERGE (p1:Page {title: line[0]})
    MERGE (p2:Page {title: line[1]})
    CREATE (p1)-[:LINKS_TO {type: line[2], quantity: toInteger(line[3])}]->(p2)
    ',
  {batchSize:25000, iterateList:true, parallel:false});

remove the "other" links

match ()-[r:LINKS_TO {type:"other"}]->()
delete r;

store the "other" link quantities in separate properties, before deleting them

match (source:Page {title:"other-empty"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-empty` = r.quantity;
 
match (source:Page {title:"other-external"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-external` = r.quantity;
 
match (source:Page {title:"other-internal"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-internal` = r.quantity;
 
match (source:Page {title:"other-other"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-other` = r.quantity;
 
match (source:Page {title:"other-search"})-[r:LINKS_TO]->(target:Page)
set target.`additional_other-search` = r.quantity;

delete all the other-relationships

Call apoc.periodic.iterate('
  match (source:Page)-[r]-()
  where source.title starts with "other-"
  return r, source',
  'Delete r',
  {batchSize:10000, parallel:false});

delete the other nodes (just a handful)

match (source:Page)
where source.title starts with "other-"
delete source;

2. Querying the data

data profiling: how many links

match (n)-[r:LINKS_TO]->(m)
return distinct r.type, count(r)

exploring pages linking to Neo4j

match (source:Page)-[sourcelink:LINKS_TO]->(neo:Page {title:"Neo4j"}) return source, sourcelink, neo

![[Pasted image 20210326074745.png]]

what pages are linking to Neo4j

match path = (sourcepage:Page)-[:LINKS_TO*..2]->(neopage:Page {title: "Neo4j"})
return path
Limit 100

![[Pasted image 20210326075048.png]]

what pages are linking to Neo4j with titles and quantities

order by nrofclicks desc;

what pages are linking to the pages linking to Neo4j

match (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"}) 
return source, sourcelink, neo

total quantity of links 2 hops away

match (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"}) 
return source.title, REDUCE(sumofq = 0, r IN sourcelink | sumofq + r.quantity) AS total
order by total desc
limit 10;

total quantity of links 2 hops away with path

match (sourceofsource:Page)-\[sourceofsourcelink:LINKS\_TO\]->(source:Page)-\[sourcelink:LINKS\_TO\]->(neo:Page {title:"Neo4j"})

return sourceofsource.title+"==>"+source.title+"==> Neo4j" as pages, sourceofsourcelink.quantity+sourcelink.quantity as total

order by total desc

limit 10;

total quantity of links 2 hops away with variable length path

We can play with the length of the path!

match path = (source:Page)-[sourcelink:LINKS_TO*..2]->(neo:Page {title:"Neo4j"}) 
RETURN [node IN nodes(path) | node.title] as titles, 
reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as total
order by total desc
limit 10

4 hop example

![[Pasted image 20210326125550.png]]

total quantity of outgoing hops with variable length path

match path = (target:Page)<-[targetlink:LINKS_TO*..4]-(neo:Page {title:"Neo4j"}) 
return reverse([node IN nodes(path) | node.title]) as titles, 
reduce(sumofq = 0, r in relationships(path) | sumofq + r.quantity) as total
order by total desc
limit 10

![[Pasted image 20210326125318.png]]

About case sensitiveness:

Case sensitive query

match (p:Page)
where p.title contains "beer"
return count(p)

Case insensitive query

match (p:Page)
where p.title =~ "(?i).*beer.*"
    return count(p)

using a fulltext (non-schema) index

Create the index

CALL db.index.fulltext.createNodeIndex("pagetitleindex", ["Page"], ["title"]);

Use the index to return counts

CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node return count(node);```

#### Use the index to return titles
```cypher
CALL db.index.fulltext.queryNodes("pagetitleindex", "*beer*") YIELD node, score
RETURN node.title;

3. Create the indexes for GDS properties

Create index on :Page(pagerank);
create index on :Page(louvain);
create index on :Page(approxBetweenness);
create index on :Page(lpa);
create index on :Page(HITS_auth);
create index on :Page(HITS_hub);

4-clickstream bloom queries

Path between $page1 and $page2

Match (p1:Page {title: $page1}), (p2:Page {title: $page2}), path = allshortestpaths ((p1)-[*]-(p2))
return path
limit 100

Louvain Community of $param

Match (p:Page)
where p.title = $param
with p.louvain as neolouvain
match (p:Page {louvain: neolouvain})
return p
limit 100;
{"name":"Clickstream perspective","id":"d6d23920-8bf2-11eb-b75f-9b97a48daad6","categories":[{"id":0,"name":"Other","color":"#6B6B6B","size":1,"icon":"no-icon","labels":[],"properties":[],"caption":[""]},{"id":1,"name":"Page","color":"#FFC454","size":1,"icon":"no-icon","labels":["Page"],"properties":[{"name":"HITS_hub","exclude":false,"isCaption":false,"dataType":"number"},{"name":"lpa","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"approxBetweenness","exclude":false,"isCaption":false,"dataType":"number"},{"name":"HITS_auth","exclude":false,"isCaption":false,"dataType":"number"},{"name":"degree","exclude":false,"isCaption":false,"dataType":"number"},{"name":"pagerank","exclude":false,"isCaption":false,"dataType":"number"},{"name":"additional_other-external","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"additional_other-internal","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"additional_other-search","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"louvain","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"title","exclude":false,"isCaption":true,"dataType":"string"},{"name":"additional_other-empty","exclude":false,"isCaption":false,"dataType":"bigint"}],"caption":[""],"createdAt":1616516066776,"lastEditedAt":1616516912117,"styleRules":[{"type":"range","size":1.151817982321778,"minSize":1,"maxSize":4,"minColor":"#D5EEE2","midColor":"#81CCA8","maxColor":"#428C6A","minPoint":"0.15000000000000002","maxPoint":"259.77450078576805","color":"#FFE081","basedOn":"pagerank_number","valuesMapper":[],"existingValues":[],"maxSizeValue":"259.77450078576805","maxColorValue":"259.77450078576805","minSizeValue":"0.15000000000000002","minColorValue":"0.15000000000000002","applyColor":false,"applySize":true}]}],"labels":{"Page":[{"propertyKey":"HITS_hub","type":"Page","dataType":"number"},{"propertyKey":"lpa","type":"Page","dataType":"bigint"},{"propertyKey":"approxBetweenness","type":"Page","dataType":"number"},{"propertyKey":"HITS_auth","type":"Page","dataType":"number"},{"propertyKey":"degree","type":"Page","dataType":"number"},{"propertyKey":"pagerank","type":"Page","dataType":"number"},{"propertyKey":"additional_other-external","type":"Page","dataType":"bigint"},{"propertyKey":"additional_other-internal","type":"Page","dataType":"bigint"},{"propertyKey":"additional_other-search","type":"Page","dataType":"bigint"},{"propertyKey":"louvain","type":"Page","dataType":"bigint"},{"propertyKey":"title","type":"Page","dataType":"string"},{"propertyKey":"additional_other-empty","type":"Page","dataType":"bigint"}]},"relationshipTypes":[{"properties":[{"propertyKey":"quantity","type":"LINKS_TO","dataType":"bigint"},{"propertyKey":"type","type":"LINKS_TO","dataType":"string"}],"name":"LINKS_TO","id":"LINKS_TO","styleRules":[{"type":"range","size":1.1216216216216217,"minSize":1,"maxSize":4,"minColor":"#D5EEE2","midColor":"#81CCA8","maxColor":"#428C6A","minPoint":"11","maxPoint":"603","color":"#FFE081","basedOn":"quantity_bigint","valuesMapper":[],"existingValues":[],"maxSizeValue":"603","maxColorValue":"603","minSizeValue":"11","minColorValue":"11","applyColor":false,"applySize":true}],"color":"#8DCC93"}],"palette":{"colors":["#FFE081","#C990C0","#F79767","#57C7E3","#F16667","#D9C8AE","#8DCC93","#ECB5C9","#4C8EDA","#FFC454","#DA7194","#569480","#848484","#D9D9D9"],"currentIndex":1},"createdAt":1616516066738,"lastEditedAt":1616516066738,"templates":[{"name":"Path between pages","id":"tmpl:1616518071376","createdAt":1616518071376,"text":"Path between $page1 and $page2","cypher":"Match (p1:Page {title: $page1}), (p2:Page {title: $page2}), path = allshortestpaths ((p1)-[*]-(p2))\nreturn path\nlimit 100","params":[{"name":"$page1","dataType":"String","suggestionLabel":"Page","suggestionProp":"title","cypher":null},{"name":"$page2","dataType":"String","suggestionLabel":"Page","suggestionProp":"title","cypher":null}],"hasCypherErrors":false},{"name":"Louvain Community of a node","id":"tmpl:1616516386622","createdAt":1616516386622,"text":"Louvain Community of $param","cypher":"Match (p:Page)\nwhere p.title = $param\nwith p.louvain as neolouvain\nmatch (p:Page {louvain: neolouvain})\nreturn p\nlimit 100;","params":[{"name":"$param","dataType":"String","suggestionLabel":"Page","suggestionProp":"title","cypher":null}],"hasCypherErrors":false}],"hiddenRelationshipTypes":[],"hiddenCategories":[],"hideUncategorisedData":false,"parentPerspectiveId":null,"metadata":{"pathSegments":[{"source":"Page","relationshipType":"LINKS_TO","target":"Page"}],"indexes":[{"label":"Page","type":"native","propertyKeys":["title"]}],"stats":{"labels":{},"relationshipTypes":{"LINKS_TO":21247069}}},"version":"1.5.1"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment