Skip to content

Instantly share code, notes, and snippets.

@dpwiz
Created March 27, 2013 06:06
Show Gist options
  • Save dpwiz/5252073 to your computer and use it in GitHub Desktop.
Save dpwiz/5252073 to your computer and use it in GitHub Desktop.
http://eax.me/scala-regular-expr/ -- RE and HTML just don't match!
>>> scrape("eax.me")[:5]
[['books-issue-2/', '33', '9.2%', '148', '8.5%', '0', '0.0%', '102', '6.4%'],
['/', '24', '6.7%', '121', '6.9%', '108', '6.2%', '108', '6.8%'],
['goodbye-freebsd/', '23', '6.4%', '115', '6.6%', '25', '1.4%', '136', '8.6%'],
['books-issue-1/', '23', '6.4%', '78', '4.5%', '58', '3.3%', '84', '5.3%'],
['scala-regular-expr/', '16', '4.5%', '3', '0.2%', '0', '0.0%', '1.6', '0.1%']]
import requests
from pyquery import PyQuery
def scrape(domain):
doc = PyQuery(requests.get("http://www.liveinternet.ru/stat/%s/pages.html?per_page=100" % domain).content)
rows = (PyQuery(e).children() for e in doc("td > label > a").parent().parent().parent())
return [[PyQuery(e).text() for e in row[1:]] for row in rows]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment