Skip to content

Instantly share code, notes, and snippets.

@paultag
Created December 31, 2012 20:24
Show Gist options
  • Save paultag/4422480 to your computer and use it in GitHub Desktop.
Save paultag/4422480 to your computer and use it in GitHub Desktop.
scrape the debian pts with hython
#!/usr/bin/env hython
; not actually clojure ...
(import "urllib2")
(import-from "lxml" "html")
(defn form-pts-url [developer]
"Return the PTS URL for a given developer."
(+ "http://qa.debian.org/developer.php?login=" developer))
(defn digest-page [url]
"Take a URL and return an lxml.html page ready for scraping."
(.fromstring html (.read (.urlopen urllib2 url))))
(defn scrape-page [page]
"Scrape the PTS for bugs"
(do
(def xpath-base "//table[contains(@class, 'packagetable sortable')]")
(zip
(.xpath page (+ xpath-base "//a[@name]/text()"))
(.xpath page (+ xpath-base "//td[2]//a[contains(@href, 'bugs.debian.org')]/text()")))))
(print (scrape-page (digest-page (form-pts-url "paultag"))))
@paultag
Copy link
Author

paultag commented Dec 31, 2012

Returns:

[('desktop-base', '32'), ('dput-ng', '1'), ('fbautostart', '-'), ('fluxbox', '51'), ('liblicense', '2'), ('python-sunlight', '-'), ('python-validictory', '1'), ('clojurepy', '-'), ('gcc-python-plugin', '-'), ('datapacker', '1'), ('syslinux-themes-debian', '5'), ('alot', '-'), ('clearlooks-phenix-theme', '-'), ('feedgnuplot', '-'), ('gbemol', '1'), ('kic', '2'), ('liblastfm', '-'), ('njam', '2'), ('pavumeter', '-'), ('python-gnupg', '1'), ('sauerbraten-wake6', '-'), ('tcc', '9'), ('textedit.app', '1'), ('tmfs', '-'), ('tox', '-'), ('vifm', '-'), ('zgv', '12'), ('lastfm', '8')]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment