Skip to content

Instantly share code, notes, and snippets.

@ncouture
Created November 6, 2016 13:34
Show Gist options
  • Save ncouture/da8e124efb543929e2d38073aa08673a to your computer and use it in GitHub Desktop.
Save ncouture/da8e124efb543929e2d38073aa08673a to your computer and use it in GitHub Desktop.
This is a re-implementation using PyQuery instead of XPaths for the Scrapy spider tutorial found here: http://doc.scrapy.org/en/latest/intro/tutorial.html#our-first-spider
from scrapy.spider import BaseSpider
# Requires this patch:
# https://github.com/joehillen/scrapy/commit/6301adcfe9933b91b3918a93387e669165a215c9
from scrapy.selector import PyQuerySelector
class DmozSpiderPyQuery(BaseSpider):
name = "pyquery"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
pq = PyQuerySelector(response)
sites = pq('ul li')
for site in sites:
title = pq(site).find('a').text()
link = pq(site).find('a').attr.href
desc = pq(site).text()
print title, link, desc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment