Skip to content

Instantly share code, notes, and snippets.

@lrvick
Created January 28, 2015 09:17
Show Gist options
  • Save lrvick/68ec2d0019217c2e32ad to your computer and use it in GitHub Desktop.
Save lrvick/68ec2d0019217c2e32ad to your computer and use it in GitHub Desktop.
Simple Python RSS crawler using lxml
import lxml.html
urls= (
'http://lrvick.net',
'http://zenhabits.net/',
'http://theoatmeal.com/',
'http://botd.wordpress.com/',
'http://informantpodcast.com/',
'http://moo.com',
)
for url in urls:
html = lxml.html.parse(url)
feed = None
title = None
try:
feed = html.xpath('//link[@type="application/rss+xml"]/@href')[0]
if not 'http' in feed:
feed = "%s%s" % (url,feed)
except:
pass
title = html.find('.//title').text
print ("%s | %s | %s" % (url, title, feed)).encode('unicode-escape')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment