Skip to content

Instantly share code, notes, and snippets.

@orangain
Last active March 16, 2016 08:28
Show Gist options
  • Save orangain/3724b86a5dc5b2a279f9 to your computer and use it in GitHub Desktop.
Save orangain/3724b86a5dc5b2a279f9 to your computer and use it in GitHub Desktop.
Testing non-ASCII URLs with simple spiders.

In my environment these spiders work well.

Environment:

(venv) $ python -V
Python 3.4.2
(venv) $ scrapy version
Scrapy 1.1.0rc3

A Spider extracting non-ASCII URLs using CSS Selecter:

(vnev) $ scrapy runspider wikipedia_spider.py
2016-03-16 16:52:21 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapybot)
2016-03-16 16:52:21 [scrapy] INFO: Overridden settings: {}
2016-03-16 16:52:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2016-03-16 16:52:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-03-16 16:52:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-03-16 16:52:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-16 16:52:21 [scrapy] INFO: Spider opened
2016-03-16 16:52:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-16 16:52:22 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8> (referer: None)
2016-03-16 16:52:22 [scrapy] DEBUG: Filtered duplicate request: <GET https://ja.wikipedia.org/wiki/%E5%B0%91%E5%B9%B4%E4%BF%9D%E8%AD%B7%E6%89%8B%E7%B6%9A#.E9.9D.9E.E8.A1.8C.E5.B0.91.E5.B9.B4> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-03-16 16:52:22 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:52:22 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6>
{'title': 'Wikipedia:ウィキペディアについて'}
2016-03-16 16:52:23 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/%E3%82%B9%E3%82%AB%E3%83%BC%E3%82%BA%E3%83%87%E3%83%BC%E3%83%AB%E7%94%B7%E7%88%B5> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:52:24 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/%E3%82%B9%E3%82%AB%E3%83%BC%E3%82%BA%E3%83%87%E3%83%BC%E3%83%AB%E7%94%B7%E7%88%B5>
{'title': 'スカーズデール子爵'}
2016-03-16 16:52:25 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/%E9%80%A3%E5%90%88%E7%8E%8B%E5%9B%BD%E8%B2%B4%E6%97%8F> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:52:25 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/%E9%80%A3%E5%90%88%E7%8E%8B%E5%9B%BD%E8%B2%B4%E6%97%8F>
{'title': '連合王国貴族'}
...

A Spider extracting non-ASCII URLs with LinkExtractor:

(venv) $ scrapy runspider wikipedia_crawl_spider.py
2016-03-16 16:55:03 [scrapy] INFO: Scrapy 1.1.0rc3 started (bot: scrapybot)
2016-03-16 16:55:03 [scrapy] INFO: Overridden settings: {}
2016-03-16 16:55:03 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2016-03-16 16:55:03 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-03-16 16:55:03 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-03-16 16:55:03 [scrapy] INFO: Enabled item pipelines:
[]
2016-03-16 16:55:03 [scrapy] INFO: Spider opened
2016-03-16 16:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-16 16:55:04 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8> (referer: None)
2016-03-16 16:55:04 [scrapy] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET https://en.wikipedia.org/wiki/Cilaos>
[...]
2016-03-16 16:55:04 [scrapy] DEBUG: Filtered offsite request to 'tr.wikipedia.org': <GET https://tr.wikipedia.org/wiki/>
2016-03-16 16:55:04 [scrapy] DEBUG: Filtered offsite request to 'www.mediawiki.org': <GET https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute>
2016-03-16 16:55:04 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:55:05 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/Wikipedia:%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87%E3%82%A3%E3%82%A2%E3%81%AB%E3%81%A4%E3%81%84%E3%81%A6>
{'title': 'Wikipedia:ウィキペディアについて'}
2016-03-16 16:55:06 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/1935%E5%B9%B4> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:55:06 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/1935%E5%B9%B4>
{'title': '1935年'}
2016-03-16 16:55:07 [scrapy] DEBUG: Crawled (200) <GET https://ja.wikipedia.org/wiki/%E3%83%B4%E3%82%A1%E3%82%A4%E3%83%9E%E3%83%AB%E5%85%B1%E5%92%8C%E5%9B%BD%E8%BB%8D> (referer: https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)
2016-03-16 16:55:07 [scrapy] DEBUG: Scraped from <200 https://ja.wikipedia.org/wiki/%E3%83%B4%E3%82%A1%E3%82%A4%E3%83%9E%E3%83%AB%E5%85%B1%E5%92%8C%E5%9B%BD%E8%BB%8D>
{'title': 'ヴァイマル共和国軍'}
...

NOTE: These spiders are sample using well-known websites for discussion. Please do not abuse it.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class WikipediaCrawlSpider(CrawlSpider):
name = 'wikipedia_crawl'
allowed_domains = ['ja.wikipedia.org']
start_urls = [
'https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8']
rules = (
Rule(LinkExtractor(allow=r'/wiki/', deny=r'#'), callback='parse_article'),
)
custom_settings = {
'DOWNLOAD_DELAY': 1,
}
def parse_article(self, response):
yield {'title': response.css('h1::text').extract_first()}
from scrapy import Spider, Request
class WikipediaSpider(Spider):
name = 'wikipedia'
allowed_domains = ['ja.wikipedia.org']
start_urls = [
'https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8']
custom_settings = {
'DOWNLOAD_DELAY': 1,
}
def parse(self, response):
for url in response.css('a::attr("href")').extract():
if url.startswith('/wiki/'):
yield Request(response.urljoin(url), callback=self.parse_article)
def parse_article(self, response):
yield {'title': response.css('h1::text').extract_first()}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment