Created
January 17, 2020 10:34
-
-
Save jykim16/9f684e9beae389ce309ad353a8c07fc9 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
df = pd.read_csv("sample_data_3.csv") | |
df['ts'] = pd.to_datetime(df['ts']) | |
# Consider only the rows with country_id = "BDV" (there are 844 such rows). | |
# For each site_id, we can compute the number of unique user_id's found in these 844 rows. | |
# Which site_id has the largest number of unique users? And what's the number? | |
is_BDV = df['country_id']=='BDV' | |
df_is_BDV = df[is_BDV] | |
df_is_BDV_by_site = df_is_BDV.groupby('site_id') | |
answer_1 = df_is_BDV_by_site['user_id'].nunique().sort_values().tail(1)[' | |
# Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59 | |
# there are four users who visited a certain site more than 10 times. | |
# Find these four users & which sites they (each) visited more than 10 times. | |
# (Simply provides four triples in the form (user_id, site_id, number of visits) in the box below.) | |
df_time = df[(df['ts']> '2019-02-03 00:00:00') & (df['ts']< '2019-02-04 23:59:59')] | |
df_time_by_user_site = df_time.groupby(['user_id','site_id']) | |
df_time['views'] = df_time_by_user_site['site_id'].transform('count') | |
users_views_site_over_10 = df_time[df_time['views'] > 10] | |
del users_views_site_over_10['ts'] | |
del users_views_site_over_10['country_id'] | |
answer_2 = users_views_site_over_10.drop_duplicates() | |
# For each site, compute the unique number of users whose last visit was to that site. | |
# For instance, user "LC3561"'s last visit is to "N0OTG" based on timestamp data. | |
# Based on this measure, what are top three sites? | |
# (hint: site "3POLC" is ranked at 5th with 28 users whose last visit in the data set was to 3POLC | |
# simply provide three pairs in the form (site_id, number of users) | |
users_last_visit = df.groupby('user_id').last() | |
answer_3 = users_last_visit['site_id'].value_counts() | |
# For each user, determine the first site he/she visited and the last site he/she visited. | |
# Compute the number of users whose first/last visits are to the same website. What is the number? | |
users_first_visit = df.groupby('user_id').first() | |
users_last_visit = df.groupby('user_id').last() | |
combined_first_last_visit = pd.concat((users_first_visit['site_id'], users_last_visit['site_id'])) | |
reset_combined_first_last_visit = combined_first_last_visit.reset_index() | |
users_with_same_first_last = reset_combined_first_last_visit[reset_combined_first_last_visit.duplicated()] | |
answer_4 = users_with_same_first_last.count() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment