Skip to content

Instantly share code, notes, and snippets.

@zuphilip
Created June 10, 2016 11:23
Show Gist options
  • Save zuphilip/4d756f4d509dfe92ea889e0d0e4b6229 to your computer and use it in GitHub Desktop.
Save zuphilip/4d756f4d509dfe92ea889e0d0e4b6229 to your computer and use it in GitHub Desktop.
Analysis of urls referenced in enwiki
240517 http://books.google.com
148920 https://books.google.com
143683 http://news.bbc.co.uk
104078 http://www.nytimes.com
100249 http://www.census.gov
85375 http://www.bbc.co.uk
62074 http://factfinder2.census.gov
51794 http://www.stat.gov.pl
47973 http://www.guardian.co.uk
43483 http://news.google.com
40215 http://www.billboard.com
39750 http://www.allmusic.com
38465 http://www.baseball-reference.com
38318 http://www.telegraph.co.uk
28273 http://query.nytimes.com
28050 http://www.washingtonpost.com
27916 http://www.imdb.com
26415 http://www.independent.co.uk
25375 http://www.theguardian.com
25088 https://news.google.com
23352 http://articles.latimes.com
22174 http://geonames.usgs.gov
21098 http://books.google.co.uk
20822 http://www.youtube.com
20467 http://www.ncbi.nlm.nih.gov
20171 http://www.amazon.com
18984 http://www.dailymail.co.uk
18178 http://www.mtv.com
17789 https://archive.org
17718 http://www.usatoday.com
17249 https://www.youtube.com
16969 http://www.cbc.ca
16967 http://www.abc.net.au
16642 http://espn.go.com
16615 http://sports.espn.go.com
16437 http://www.soccerbase.com
16359 http://www.reuters.com
16223 http://www.time.com
16059 http://www.cricketarchive.com
15399 http://www.smh.com.au
15383 http://www.metacritic.com
14971 http://www.rollingstone.com
14856 http://www.discogs.com
14732 http://www.archive.org
14471 http://www.portal.state.pa.us
14370 http://www.sports-reference.com
13950 http://pqasb.pqarchiver.com
13805 http://nla.gov.au
13754 http://www.huffingtonpost.com
13737 http://www.animenewsnetwork.com
13603 http://www.highbeam.com
13432 http://tvbythenumbers.zap2it.com
13154 http://www.gamespot.com
13048 http://www.digitalspy.co.uk
12259 http://www.cnn.com
12084 http://www.espncricinfo.com
12037 http://www.wwe.com
11453 http://www.hollywoodreporter.com
11113 http://www.forbes.com
11102 http://www.thehindu.com
11029 http://www.rsssf.com
10927 http://www.hindu.com
10917 http://select.nytimes.com
10910 http://www.ew.com
10760 http://books.google.ca
10371 https://itunes.apple.com
10256 http://www.variety.com
10228 http://pwtorch.com
10226 http://www.bloomberg.com
10217 http://timesofindia.indiatimes.com
10193 http://www.nba.com
10156 http://factfinder.census.gov
9888 http://www.latimes.com
9784 http://articles.timesofindia.indiatimes.com
9759 http://www.sfgate.com
9485 http://www.theage.com.au
9468 http://www.nzherald.co.nz
9362 http://www.boston.com
9360 http://www.ign.com
9358 http://www.uefa.com
9175 http://online.wsj.com
9112 http://www.rte.ie
9000 http://www.collectionscanada.gc.ca
8988 http://www.pro-football-reference.com
8657 http://www.timesonline.co.uk
8605 http://www.basketball-reference.com
8539 http://www.allmovie.com
8498 http://nl.newsbank.com
8453 http://www.nydailynews.com
8436 http://www.bizjournals.com
8328 http://www.independent.ie
8197 http://slam.canoe.ca
8168 http://www.officialcharts.com
8102 http://www.jstor.org
8070 http://www.npr.org
8054 http://www.nps.gov
7941 http://www.rottentomatoes.com
7794 http://www.nhc.noaa.gov
7700 http://www.fifa.com
7659 http://www.flightglobal.com
# download and unpack the tsv file from
# https://zenodo.org/record/55004#
perl -wnE 'say $1 if /(https?:\/\/[^\/"]+)/' enwiki_2016-06-01_CS1_citations.tsv > enwiki-baseurls.txt
sort enwiki-baseurls.txt | uniq -c | sort -n -r > enwiki-output.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment