Skip to content

Instantly share code, notes, and snippets.

@okoliechykwuka
Forked from 0xbf00/README.md
Created October 28, 2022 22:59
Show Gist options
  • Save okoliechykwuka/56960a76b11b1997f915b4ac19b59aed to your computer and use it in GitHub Desktop.
Save okoliechykwuka/56960a76b11b1997f915b4ac19b59aed to your computer and use it in GitHub Desktop.
Workaround for Scrapy issue #355 (Scrapy failure due to overly long headers)

The issue

So you've stumbled upon this bug? Or you've gotten a message similar to the following?

2018-09-11 17:57:04 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: mac_scraper)
2018-09-11 17:57:04 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0dev0, Python 3.7.0 (default, Jun 29 2018, 20:13:13) - [Clang 9.1.0 (clang-902.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-09-11 17:57:04 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'mac_scraper', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'mac_scraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['mac_scraper.spiders']}
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-11 17:57:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-11 17:57:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-11 17:57:04 [scrapy.core.engine] INFO: Spider opened
2018-09-11 17:57:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/robots.txt> from <GET https://macupdate.com/robots.txt>
2018-09-11 17:57:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.macupdate.com/> from <GET https://macupdate.com>
2018-09-11 17:57:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.macupdate.com/robots.txt> (referer: None)
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 1 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:06 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.macupdate.com/> (failed 2 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
2018-09-11 17:57:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.macupdate.com/> (failed 3 times): [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/shell.py", line 73, in run
    shell.start(url=url, redirect=not opts.no_redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 48, in start
    self.fetch(url, spider, redirect=redirect)
  File "/usr/local/lib/python3.7/site-packages/scrapy/shell.py", line 114, in fetch
    result = threads.blockingCallFromThread(reactor, self._schedule, request, spider)
  File "/usr/local/lib/python3.7/site-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
    result.raiseException()
  File "/usr/local/lib/python3.7/site-packages/twisted/python/failure.py", line 467, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.web._newclient.ResponseFailed: [<twisted.python.failure.Failure builtins.ValueError: not enough values to unpack (expected 2, got 1)>]

The issue is triggered by a server sending overly long header values. This gist helps you work-around the issue.

The workaround

The workaround simply proxies all requests through mitmproxy and uses a custom script to remove overly long headers from responses. Modified responses can then be processed by scrapy.

  1. Install mitmproxy
  2. Configure mitmproxy to be able to proxy TLS connections. Refer to the mitmproxy documentation for this.
  3. Modify the middlewares.py file in your scrapy project to include the following snippet:
class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://localhost:8080'
  1. Modify the settings.py file in your scrapy project as follows:
DOWNLOADER_MIDDLEWARES = {
	'your_scraper.middlewares.ProxyMiddleware': 350,
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400
}
  1. Start mitmproxy: mitmproxy -s header_remover.py and supply the header_remover.py file (provided below).
  2. Simply execute scrapy crawl your_scraper as you would normally.
class RemoveOverlyLongHeaders:
def __init__(self, max_size = 16384):
self.max_size = max_size
def response(self, flow):
for header in flow.response.headers:
header_value = flow.response.headers[header]
if len(header_value) > self.max_size:
del flow.response.headers[header]
addons = [
RemoveOverlyLongHeaders()
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment