For Nick

Sample Data

I tried to copy your data model as I understood it. I created two User nodes each with five View nodes. User 1’s sequence of views touches a distinct set of Page nodes; User 2’s sequence of views is not a distinct set; User 2 visits the analytics page twice.

CREATE

// Users.
(u1:User {username:'User 1'}),
(u2:User {username:'User 2'}),

// Pages.
(p1:Page {id:'snowplowanalytics.com/'}),
(p2:Page {id:'snowplowanalytics.com/about/index.html'}),
(p3:Page {id:'snowplowanalytics.com/technology/index.html'}),
(p4:Page {id:'snowplowanalytics.com/product/index.html'}),
(p5:Page {id:'snowplowanalytics.com/analytics/index.html'}),
(p6:Page {id:'snowplowanalytics.com/blog/index.html'}),

// Views for User 1.
(v1:View {time:1, page:'snowplowanalytics.com/'}),
(v2:View {time:2, page:'snowplowanalytics.com/about/index.html'}),
(v3:View {time:3, page:'snowplowanalytics.com/technology/index.html'}),
(v4:View {time:4, page:'snowplowanalytics.com/product/index.html'}),
(v5:View {time:5, page:'snowplowanalytics.com/blog/index.html'}),

(v1)-[:OBJECT]->(p1),
(v2)-[:OBJECT]->(p2),
(v3)-[:OBJECT]->(p3),
(v4)-[:OBJECT]->(p4),
(v5)-[:OBJECT]->(p6),

(v1)<-[:PREV]-(v2)<-[:PREV]-(v3)<-[:PREV]-(v4)<-[:PREV]-(v5),

(u1)-[:VERB]->(v1),
(u1)-[:VERB]->(v2),
(u1)-[:VERB]->(v3),
(u1)-[:VERB]->(v4),
(u1)-[:VERB]->(v5),

// Views for User 2.

(v6:View {time:1, page:'snowplowanalytics.com/'}),
(v7:View {time:2, page:'snowplowanalytics.com/analytics/index.html'}),
(v8:View {time:3, page:'snowplowanalytics.com/blog/index.html'}),
(v9:View {time:4, page:'snowplowanalytics.com/analytics/index.html'}),
(v10:View {time:5, page:'snowplowanalytics.com/product/index.html'}),

(v6)-[:OBJECT]->(p1),
(v7)-[:OBJECT]->(p5),
(v8)-[:OBJECT]->(p6),
(v9)-[:OBJECT]->(p5),
(v10)-[:OBJECT]->(p4),

(v6)<-[:PREV]-(v7)<-[:PREV]-(v8)<-[:PREV]-(v9)<-[:PREV]-(v10),

(u2)-[:VERB]->(v6),
(u2)-[:VERB]->(v7),
(u2)-[:VERB]->(v8),
(u2)-[:VERB]->(v9),
(u2)-[:VERB]->(v10);

MATCH n RETURN n;

An Incremental Solution

Since you store the URL on the View nodes as a page property, this problem is pretty approachable. Let’s first look at the two users' sequence of page views:

MATCH p = (:View)<-[:PREV*4]-(:View)
RETURN EXTRACT(v IN NODES(p) | v.page) AS `Pages Visited`

As I said earlier, we see that User 1 has visited a distinct set of URLs while User 2 has visited 'snowplowanalytics.com/analytics/index.html' twice. Because we specified [:PREV*4] between two View nodes, we are matching length-four pathways that have five View nodes:

MATCH p = (:View)<-[:PREV*4]-(:View)
RETURN LENGTH(NODES(p)) AS `Number of View Nodes`

To ensure that we are only matching pathways where the page property of each View node is distinct across the path, we can simply COUNT(DISTINCT identifier) and compare this to the number of nodes on the path. For this we’ll need the magical UNWIND:

MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS count
RETURN LENGTH(NODES(p)) AS `Number of View Nodes`, count AS `Number of Distinct Views`

From here, we’ll want to filter out the pathways whose count of distinct views is not equal to the number of view nodes. A visual result:

MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS distinct_views
WHERE distinct_views = LENGTH(NODES(p))
RETURN p;

A tabular result:

MATCH p = (:View)<-[:PREV*4]-(:View)
WITH p, EXTRACT(v IN NODES(p) | v.page) AS pages
UNWIND pages AS views
WITH p, COUNT(DISTINCT views) AS distinct_views
WHERE distinct_views = LENGTH(NODES(p))
RETURN EXTRACT(v IN NODES(p) | v.page) AS `Pages Visited`, COUNT(*) AS Count

And we see that only the pathways with a distinct set of page visits are returned. Please leave comments with any questions.

ef2k/website_pathways.adoc

For Nick

Sample Data

An Incremental Solution