So you've stumbled upon this bug? Or you've gotten a message similar to the following?
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
shell.start(url=url, redirect=not opts.no_redirect)
File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
self.fetch(url, spider, redirect=redirect)
File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
result.raiseException()
File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
The issue is triggered by a server sending overly long header values. This gist helps you work-around the issue.
The workaround simply proxies all requests through mitmproxy
and uses a custom script to remove overly long headers from responses. Modified responses can then be processed by scrapy
.
- Install
mitmproxy
- Configure
mitmproxy
to be able to proxy TLS connections. Refer to themitmproxy
documentation for this. - Modify the
middlewares.py
file in your scrapy project to include the following snippet:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'http://localhost:8080'
- Modify the
settings.py
file in your scrapy project as follows:
DOWNLOADER_MIDDLEWARES = {
'your_scraper.middlewares.ProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400
}
- Start
mitmproxy
:mitmproxy -s header_remover.py
and supply theheader_remover.py
file (provided below). - Simply execute
scrapy crawl your_scraper
as you would normally.