Created
November 13, 2019 08:28
-
-
Save arun057/d8dee65ece54de42b574f287ed9e601d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- Imported data into postgresql running locally. For larger data sets I would go the map reduce route, seemed overkill here. | |
-- All of the questions are answered with sql - I did end up double checking some of the results with python - happy to provide that code on request. | |
-- Consider only the rows with country_id = "BDV" (there are 844 such rows). | |
-- For each site_id, we can compute the number of unique user_id's found in these 844 rows. | |
-- Which site_id has the largest number of unique users? And what's the number? | |
SELECT x.site_id, COUNT(x.site_id) as site_count | |
FROM (SELECT DISTINCT user_id, site_id FROM visits WHERE country_id = 'BDV') AS x | |
GROUP BY x.site_id | |
ORDER BY site_count DESC | |
LIMIT 1 | |
-- site_id site_count | |
-- "5NPAU" "544" | |
-- Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59, | |
-- there are four users who visited a certain site more than 10 times. | |
-- Find these four users & which sites they (each) visited more than 10 times. | |
-- (Simply provides four triples in the form (user_id, site_id, number of visits) | |
-- in the box below.) | |
SELECT user_id, site_id, (COUNT(site_id)) AS site_count | |
FROM visits | |
WHERE ts > '2019-02-03 00:00:00' | |
and ts < '2019-02-04 23:59:59' | |
GROUP BY user_id, site_id | |
HAVING COUNT(site_id) > 10 | |
ORDER BY site_count DESC; | |
-- user_id site_id site_count | |
-- "LC3A59" "N0OTG" "26" | |
-- "LC06C3" "N0OTG" "25" | |
-- "LC3C9D" "N0OTG" "17" | |
-- "LC3C7E" "3POLC" "15" | |
-- For each site, compute the unique number of users whose last visit | |
-- (found in the original data set) was to that site. | |
-- For instance, user "LC3561"'s last visit is to "N0OTG" based on | |
-- timestamp data. Based on this measure, what are top three sites? | |
-- (hint: site "3POLC" is ranked at 5th with 28 users whose last | |
-- visit in the data set was to 3POLC; | |
-- simply provide three pairs in the form (site_id, number of users).) | |
SELECT v.site_id, COUNT(v.user_id) as site_count | |
FROM visits AS v | |
INNER JOIN ( | |
SELECT MAX(ts) as latest_ts, user_id | |
FROM visits | |
GROUP BY user_id | |
ORDER BY user_id | |
) AS x | |
ON v.user_id=x.user_id AND v.ts=x.latest_ts | |
GROUP BY v.site_id | |
ORDER BY site_count DESC | |
LIMIT 3 | |
-- site_id site_count | |
-- "5NPAU" "992" | |
-- "N0OTG" "561" | |
-- "QGO3G" "289" | |
-- For each user, determine the first site he/she visited and | |
-- the last site he/she visited based on the timestamp data. | |
-- Compute the number of users whose first/last visits | |
-- are to the same website. What is the number? | |
SELECT COUNT(*) FROM ( | |
SELECT a.user_id, a.site_id, a.ts | |
FROM visits AS a | |
INNER JOIN | |
( | |
SELECT MIN(c.ts) as first_visit, c.user_id | |
FROM visits AS c | |
GROUP BY c.user_id | |
) AS b | |
ON b.first_visit=a.ts AND b.user_id=a.user_id | |
) AS low | |
INNER JOIN ( | |
SELECT a.user_id, a.site_id, a.ts | |
FROM visits AS a | |
INNER JOIN | |
( | |
SELECT MAX(c.ts) as first_visit, c.user_id | |
FROM visits AS c | |
GROUP BY c.user_id | |
) AS b | |
ON b.first_visit=a.ts AND b.user_id=a.user_id | |
) AS high | |
ON low.site_id=high.site_id AND low.user_id=high.user_id | |
-- count | |
-- "1670" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment