Created
January 19, 2011 14:57
-
-
Save edsu/786261 to your computer and use it in GitHub Desktop.
redis db of wikipedia externallink stats
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # note, the links themselves are not in the db, just the stats about them | |
| # I didn't have enough memory, even w/ redis vm, to load the actual links too | |
| wget http://inkdroid.org/data/wikipedia-extlinks.rdb | |
| sudo aptitude install redis-server | |
| sudo mv wikipedia-extlinks.rdb /var/lib/redis/dump.rdb | |
| sudo chown redis:redis /var/lib/redis/dump.rdb | |
| sudo /etc/init.d/redis-server restart | |
| sudo pip install redis # or easy_install (version in apt is kinda old) | |
| ed@rorty:~$ python | |
| Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) | |
| [GCC 4.4.3] on linux2 | |
| Type "help", "copyright", "credits" or "license" for more information. | |
| >>> import redis | |
| >>> r = redis.Redis() | |
| >>> r.zrevrange("hosts", 0, 25, True) | |
| [('toolserver.org', 2360809.0), ('www.ncbi.nlm.nih.gov', 508702.0), ('dx.doi.org', 410293.0), ('commons.wikimedia.org', 408986.0), ('www.imdb.com', 398877.0), ('www.nsesoftware.nl', 390636.0), ('maps.google.com', 346997.0), ('books.google.com', 323111.0), ('news.bbc.co.uk', 214738.0), ('tools.wikimedia.de', 181215.0), ('edwardbetts.com', 168102.0), ('dispatch.opac.d-nb.de', 166322.0), ('web.archive.org', 165665.0), ('www.insee.fr', 160797.0), ('www.iucnredlist.org', 155620.0), ('stable.toolserver.org', 155335.0), ('www.openstreetmap.org', 154127.0), ('d-nb.info', 141504.0), ('ssd.jpl.nasa.gov', 137200.0), ('www.youtube.com', 133827.0), ('www.google.com', 131011.0), ('www.census.gov', 124182.0), ('www.allmusic.com', 117602.0), ('maps.yandex.ru', 114978.0), ('news.google.com', 102111.0), ('amigo.geneontology.org', 95972.0)] | |
| >>> r.zrevrange("hosts:edu", 0, 25, True) | |
| [('nedwww.ipac.caltech.edu', 28642.0), ('adsabs.harvard.edu', 25699.0), ('animaldiversity.ummz.umich.edu', 21747.0), ('www.perseus.tufts.edu', 20438.0), ('genome.ucsc.edu', 20290.0), ('cfa-www.harvard.edu', 14234.0), ('penelope.uchicago.edu', 9806.0), ('www.bucknell.edu', 8627.0), ('www.law.cornell.edu', 7530.0), ('biopl-a-181.plantbio.cornell.edu', 5747.0), ('ucjeps.berkeley.edu', 5452.0), ('plato.stanford.edu', 5243.0), ('www.fiu.edu', 5004.0), ('www.volcano.si.edu', 4507.0), ('calphotos.berkeley.edu', 4446.0), ('www.usc.edu', 4345.0), ('ftp.met.fsu.edu', 3941.0), ('web.mit.edu', 3548.0), ('www.lpi.usra.edu', 3497.0), ('insects.tamu.edu', 3479.0), ('www.cfa.harvard.edu', 3447.0), ('www.columbia.edu', 3260.0), ('www.yale.edu', 3122.0), ('www.fordham.edu', 2963.0), ('www.people.fas.harvard.edu', 2908.0), ('genealogy.math.ndsu.nodak.edu', 2726.0)] | |
| >>> r.zrevrange("tlds", 0, 25, True) | |
| [('com', 11368104.0), ('org', 7785866.0), ('de', 1857158.0), ('gov', 1767137.0), ('uk', 1489505.0), ('fr', 1173624.0), ('ru', 897413.0), ('net', 868337.0), ('edu', 793838.0), ('jp', 733995.0), ('nl', 707177.0), ('pl', 590058.0), ('it', 486441.0), ('ca', 408163.0), ('au', 387764.0), ('info', 296508.0), ('br', 276599.0), ('es', 242767.0), ('ch', 224692.0), ('us', 179223.0), ('at', 163397.0), ('be', 132395.0), ('cz', 92683.0), ('eu', 91671.0), ('ar', 89856.0), ('mil', 87788.0)] | |
| >>> r.zscore("hosts", "www.bbc.co.uk") | |
| Basically there are a few sorted sets in there: | |
| "hosts": all the hosts sorted by the number of externallinks | |
| "hosts:%s": where %s is a top level domain ("com", "uk", etc) | |
| "tlds": all the tlds sorted by the number of externallinks | |
| "wikipedia": the wikipedia langauges sorted by total number of externallinks | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment