A few thoughts on creating a "universal user ID":
Unfortuantely, this isn't a trival task. There are some general ways that you can get close, but Segment's data is rarely clean enough that you can create a perfect mapping.
Ultimately, the problem stems from the fact that Segment's two basic ways of tracking identity - anonymous ids and user ids - overlap in both directions. Anonymous ids, which are basically browser cookies, can map to multiple user ids if multiple people log in from the same browser. And user ids can map to multiple anonymous ids if a user uses multiple browers.
This means that there are some cases when you can't conclusively say which user an anonymous id represents. There are ways to make best guesses, but it's not certain.
First, I create a big table of all the user_ids and anonymous_ids that we've tracked. How you do this exactly depends a little bit on your implementation fo Segment, but we do something like this:
WITH events AS (
SELECT user_id, anonymous_id, sent_at FROM identifies
UNION ALL
SELECT user_id, anonymous_id, sent_at FROM tracks
UNION ALL
SELECT user_id, anonymous_id, sent_at FROM pages
)
This creates a big list of every event we've ever recorded and the anonymous id and user id from that event. Some folks use tracks and not identifies, or pages and identifies and not tracks; which tables you include here depends on this setup. But generally, I believe in including more events rather than being discrimating about which tables are included.
Once you have this table, I find all the combinations of anonymous ids and user ids:
SELECT user_id,
anonymous_id,
COUNT(*) AS events,
MIN(sent_at) AS first_event
FROM events
WHERE anonymous_id IS NOT NULL
AND anonymous_id != ''
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY 1,2
(Note that you have to account for both nulls and empty strings - I think that might be an issue with Segment.) With this mapping you can now find which anonymous ids map to which user ids.
Next, you have to do a couple things to clear out duplicates and figure out which user id an anonymous id maps to if that anonymous id has multiple matches to a single user id. My method here is pretty crude - I just take the one with the most events.
To do that, I wrap this query in a query that adds some window functions. I also typically find the first event associated with that user id, which tells me roughly when that user first signed up:
mapping AS (
SELECT anonymous_id,
user_id,
MIN(first_event) OVER (PARTITION BY user_id) AS first_event_as_user,
RANK() OVER (PARTITION BY anonymous_id ORDER BY events, first_event) AS user_rank
FROM (
SELECT user_id,
anonymous_id,
COUNT(*) AS events,
MIN(sent_at) AS first_event
FROM events
WHERE anonymous_id IS NOT NULL
AND anonymous_id != ''
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY 1,2
) a
ORDER BY 3 DESC
)
From here, you can now join this mapping onto your list of events (like a tracks table). Remember that the mapping above only maps to anonymous ids that have an associated user id, so you have to use a left join here. You also have to filter out the mappings where the user_rank
is greater than 1. Then, the logic in the select part of the query determines for each event if the user's identity is eventually known, and creates a universal id based on that.
mapped_events AS (
SELECT CASE WHEN (m.user_id IS NULL OR m.user_id = '')
AND (t.user_id IS NULL OR t.user_id = '')
THEN 'unknown' ELSE 'known' END AS identity_known,
CASE WHEN (t.user_id IS NOT NULL AND t.user_id != '') THEN t.user_id
WHEN (m.user_id IS NOT NULL AND m.user_id != '') THEN m.user_id
ELSE t.anonymous_id END AS universal_id,
t.*
FROM tracks t
LEFT JOIN mapping m
ON m.anonymous_id = t.anonymous_id
)
Finally, you can then use the same session query from the blog post to find sessions by universal user id. The query is exactly the same; just replace user_id
with universal_user_id
.
SELECT *,
SUM(is_new_session) OVER (ORDER BY universal_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY universal_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
FROM (
SELECT *,
CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT *,
LAG(sent_at,1) OVER (PARTITION BY universal_id ORDER BY sent_at) AS last_event
FROM mapped_events
) last
) final
You can put this all together to get something like the query at the bottom.
This isn't perfect, because it doesn't quite handle all of the possible edge cases you can have with messy data. But it should cover the overwhelming majority of cases. It also translates pretty well into mobile tracking; if your logging is fairly standard, you can replace anonymous ids with device ids and all the other concepts are the same.
WITH events AS (
SELECT user_id, anonymous_id, sent_at FROM identifies
UNION ALL
SELECT user_id, anonymous_id, sent_at FROM tracks
UNION ALL
SELECT user_id, anonymous_id, sent_at FROM pages
),
mapping AS (
SELECT anonymous_id,
user_id,
MIN(first_event) OVER (PARTITION BY user_id) AS first_event_as_user,
RANK() OVER (PARTITION BY anonymous_id ORDER BY events, first_event) AS user_rank
FROM (
SELECT user_id,
anonymous_id,
COUNT(*) AS events,
MIN(sent_at) AS first_event
FROM events
WHERE anonymous_id IS NOT NULL
AND anonymous_id != ''
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY 1,2
) a
ORDER BY 3 DESC
),
mapped_events AS (
SELECT CASE WHEN (m.user_id IS NULL OR m.user_id = '')
AND (t.user_id IS NULL OR t.user_id = '')
THEN 'unknown' ELSE 'known' END AS identity_known,
CASE WHEN (t.user_id IS NOT NULL AND t.user_id != '') THEN t.user_id
WHEN (m.user_id IS NOT NULL AND m.user_id != '') THEN m.user_id
ELSE t.anonymous_id END AS universal_id,
t.*
FROM tracks t
LEFT JOIN mapping m
ON m.anonymous_id = t.anonymous_id
ORDER BY t.anonymous_id
)
SELECT *,
SUM(is_new_session) OVER (ORDER BY universal_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id,
SUM(is_new_session) OVER (PARTITION BY universal_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id
FROM (
SELECT *,
CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10)
OR last_event IS NULL
THEN 1 ELSE 0 END AS is_new_session
FROM (
SELECT *,
LAG(sent_at,1) OVER (PARTITION BY universal_id ORDER BY sent_at) AS last_event
FROM mapped_events
) last
) final