Created
April 7, 2021 01:36
-
-
Save papachristoumarios/8bd40b4543ed44a594ee7be5c490879f to your computer and use it in GitHub Desktop.
Attributed Github Dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- find programming languages a user has been using (only as an owner) | |
--- contains 19M rows 10M of them are not NULL | |
select user_id, group_concat(org_id order by org_id separator ',') | |
from organization_members | |
group by user_id; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
select * from followers; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- find programming languages a user has been using (only as an owner) | |
--- contains 19M rows 10M of them are not NULL | |
select owner_id, group_concat(distinct(language) order by language separator ',') | |
from projects | |
where owner_id != -1 | |
group by owner_id; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.preprocessing import MultiLabelBinarizer as MLB | |
import sys | |
mlb = MLB() | |
lines = sys.stdin.read().splitlines() | |
feats = [] | |
users = [] | |
for line in lines: | |
user, languages = line.split('\t') | |
languages = set(languages.split(',')) | |
feats.append(languages) | |
users.append(user) | |
feats = mlb.fit_transform(feats) | |
for feat, user in zip(feats, users): | |
print('{} {}'.format(user, ' '.join(feat.astype(str)))) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment