Created
November 18, 2013 04:12
-
-
Save jaredwinick/7522356 to your computer and use it in GitHub Desktop.
Calculate the number of unique inbound and outbound links between subdomains.
Store the top 25 of each.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wordpress.org 2335856 | |
youtube.com 2073535 | |
gmpg.org 1784793 | |
en.wikipedia.org 1545864 | |
tumblr.com 1158767 | |
twitter.com 1036611 | |
google.com 798348 | |
flickr.com 715872 | |
rtalabel.org 657414 | |
wordpress.com 646766 | |
mp3shake.com 549122 | |
w3schools.com 507184 | |
domains.lycos.com 479711 | |
staff.tumblr.com 478081 | |
club.tripod.com 474609 | |
creativecommons.org 469421 | |
vimeo.com 433196 | |
miibeian.gov.cn 432644 | |
facebook.com 409346 | |
phpbb.com 402859 | |
livejournal.com 338299 | |
deviantart.com 323751 | |
forum.bluepink.ro 307289 | |
bluepink.ro 307233 | |
dictionarweb.com 307187 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
Data from Subdomain Graph at http://webdatacommons.org/hyperlinkgraph/ | |
Index: http://web.informatik.uni-mannheim.de/wdc/graph/2012/sd-index.gz | |
Arcs: http://web.informatik.uni-mannheim.de/wdc/graph/2012/sd-arc.gz | |
GZIP files were split to increase parallelism | |
Calculate the number of unique inbound and outbound links between subdomains. | |
Store the top 25 of each. | |
*/ | |
INDEX = LOAD '/data/commoncrawl/split/sd-index*' USING PigStorage('\t') AS (domain:chararray,id:int); | |
ARCS = LOAD '/data/commoncrawl/split/sd-arc*' USING PigStorage('\t') as (origin:int,target:int); | |
OUTBOUND_LINKS = GROUP ARCS BY origin; | |
OUTBOUND_LINKS_COUNT = FOREACH OUTBOUND_LINKS GENERATE group, COUNT(ARCS) AS count:long; | |
OUTBOUND_LINKS_COUNT_SORTED = ORDER OUTBOUND_LINKS_COUNT BY count DESC; | |
TOP_OUTBOUND_LINKS = LIMIT OUTBOUND_LINKS_COUNT_SORTED 25; | |
TOP_OUTBOUND_LINKS_NAMED = JOIN INDEX BY id, TOP_OUTBOUND_LINKS BY group USING 'replicated'; | |
TOP_OUTBOUND_LINKS_NAMED_PROJ = FOREACH TOP_OUTBOUND_LINKS_NAMED GENERATE $0 as domain:chararray,$3 as count:long; | |
TOP_OUTBOUND_LINKS_NAMED_SORTED = ORDER TOP_OUTBOUND_LINKS_NAMED_PROJ BY count DESC; | |
STORE TOP_OUTBOUND_LINKS_NAMED_SORTED INTO '/data/commoncrawl/top_outbound_links' USING PigStorage('\t'); | |
INBOUND_LINKS = GROUP ARCS BY target; | |
INBOUND_LINKS_COUNT = FOREACH INBOUND_LINKS GENERATE group, COUNT(ARCS) as count:long; | |
INBOUND_LINKS_COUNT_SORTED = ORDER INBOUND_LINKS_COUNT BY count DESC; | |
TOP_INBOUND_LINKS = LIMIT INBOUND_LINKS_COUNT_SORTED 25; | |
TOP_INBOUND_LINKS_NAMED = JOIN INDEX BY id, TOP_INBOUND_LINKS BY group USING 'replicated'; | |
TOP_INBOUND_LINKS_NAMED_PROJ = FOREACH TOP_INBOUND_LINKS_NAMED GENERATE $0 as domain:chararray,$3 as count:long; | |
TOP_INBOUND_LINKS_NAMED_SORTED = ORDER TOP_INBOUND_LINKS_NAMED_PROJ BY count DESC; | |
STORE TOP_INBOUND_LINKS_NAMED_SORTED INTO '/data/commoncrawl/top_inbound_links' USING PigStorage('\t'); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
youtube.com 1317316 | |
tripod.lycos.com 999651 | |
serebella.com 758959 | |
top20directory.com 691324 | |
en.wikipedia.org 573991 | |
botw.org 532654 | |
dmoz.org 508522 | |
refertus.info 479524 | |
jcsearch.com 453339 | |
tatu.us 398944 | |
flickr.com 323416 | |
blau-webkatalog.com 322286 | |
freeseek.org 294543 | |
actipages.net 255125 | |
factbites.com 247715 | |
search.refertus.info 241544 | |
twitter.com 239051 | |
sites.google.com 233946 | |
rozhled.net 218232 | |
free-press-release.com 199854 | |
dir.yahoo.com 196882 | |
twitaholic.com 181153 | |
searchpixie.com 181117 | |
linkagogo.com 179510 | |
topiasearch.com 177709 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment