I tried to copy your data model as I understood it. I created two User
nodes each with five View
nodes. User 1’s sequence of views touches a distinct set of Page
nodes; User 2’s sequence of views is not a distinct set; User 2 visits the analytics page twice.
CREATE
// Users.
(u1:User {username:'User 1'}),
(u2:User {username:'User 2'}),
// Pages.
(p1:Page {id:'snowplowanalytics.com/'}),
(p2:Page {id:'snowplowanalytics.com/about/index.html'}),
(p3:Page {id:'snowplowanalytics.com/technology/index.html'}),
(p4:Page {id:'snowplowanalytics.com/product/index.html'}),
(p5:Page {id:'snowplowanalytics.com/analytics/index.html'}),
(p6:Page {id:'snowplowanalytics.com/blog/index.html'}),
// Views for User 1.
(v1:View {time:1, page:'snowplowanalytics.com/'}),
(v2:View {time:2, page:'snowplowanalytics.com/about/index.html'}),
(v3:View {time:3, page:'snowplowanalytics.com/technology/index.html'}),
(v4:View {time:4, page:'snowplowanalytics.com/product/index.html'}),
(v5:View {time:5, page:'snowplowanalytics.com/blog/index.html'}),
(v1)-[:OBJECT]->(p1),
(v2)-[:OBJECT]->(p2),
(v3)-[:OBJECT]->(p3),
(v4)-[:OBJECT]->(p4),
(v5)-[:OBJECT]->(p6),
(v1)<-[:PREV]-(v2)<-[:PREV]-(v3)<-[:PREV]-(v4)<-[:PREV]-(v5),
(u1)-[:VERB]->(v1),
(u1)-[:VERB]->(v2),
(u1)-[:VERB]->(v3),
(u1)-[:VERB]->(v4),
(u1)-[:VERB]->(v5),
// Views for User 2.
(v6:View {time:1, page:'snowplowanalytics.com/'}),
(v7:View {time:2, page:'snowplowanalytics.com/analytics/index.html'}),
(v8:View {time:3, page:'snowplowanalytics.com/blog/index.html'}),
(v9:View {time:4, page:'snowplowanalytics.com/analytics/index.html'}),
(v10:View {time:5, page:'snowplowanalytics.com/product/index.html'}),
(v6)-[:OBJECT]->(p1),
(v7)-[:OBJECT]->(p5),
(v8)-[:OBJECT]->(p6),
(v9)-[:OBJECT]->(p5),
(v10)-[:OBJECT]->(p4),
(v6)<-[:PREV]-(v7)<-[:PREV]-(v8)<-[:PREV]-(v9)<-[:PREV]-(v10),
(u2)-[:VERB]->(v6),
(u2)-[:VERB]->(v7),
(u2)-[:VERB]->(v8),
(u2)-[:VERB]->(v9),
(u2)-[:VERB]->(v10);
MATCH n RETURN n;
Since you store the URL on the View
nodes as a page
property, this problem is pretty approachable. Let’s first look at the two users' sequence of page views:
MATCH p = (:View)<-[:PREV*4]-(:View)
RETURN EXTRACT(v IN NODES(p) | v.page) AS `Pages Visited`
As I said earlier, we see that User 1 has visited a distinct set of URLs while User 2 has visited 'snowplowanalytics.com/analytics/index.html' twice. Because we specified [:PREV*4]
between two View
nodes, we are matching length-four pathways that have five View
nodes:
MATCH p = (:View)<-[:PREV*4]-(:View)
RETURN LENGTH(NODES(p)) AS `Number of View Nodes`
To ensure that we are only matching pathways where the page
property of each View
node is distinct across the path, we can simply COUNT(DISTINCT identifier)
and compare this to the number of nodes on the path. For this we’ll need the magical UNWIND
:
MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS count
RETURN LENGTH(NODES(p)) AS `Number of View Nodes`, count AS `Number of Distinct Views`
From here, we’ll want to filter out the pathways whose count of distinct views is not equal to the number of view nodes. A visual result:
MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS distinct_views
WHERE distinct_views = LENGTH(NODES(p))
RETURN p;
A tabular result:
MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS distinct_views
WHERE distinct_views = LENGTH(NODES(p))
RETURN EXTRACT(v IN NODES(p) | v.page) AS `Pages Visited`, COUNT(*) AS Count
And we see that only the pathways with a distinct set of page visits are returned. Please leave comments with any questions.