Created
June 15, 2014 03:26
-
-
Save cydu/8a4b9855c5e21423c9c5 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DOWNLOAD_HANDLERS = { | |
'http': 'myspider.socks5_http.Socks5DownloadHandler', | |
'https': 'myspider.socks5_http.Socks5DownloadHandler' | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from txsocksx.http import SOCKS5Agent | |
from twisted.internet import reactor | |
from scrapy.xlib.tx import TCP4ClientEndpoint | |
from scrapy.core.downloader.webclient import _parse | |
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler, ScrapyAgent | |
class Socks5DownloadHandler(HTTP11DownloadHandler): | |
def download_request(self, request, spider): | |
"""Return a deferred for the HTTP download""" | |
agent = ScrapySocks5Agent(contextFactory=self._contextFactory, pool=self._pool) | |
return agent.download_request(request) | |
class ScrapySocks5Agent(ScrapyAgent): | |
def _get_agent(self, request, timeout): | |
bindAddress = request.meta.get('bindaddress') or self._bindAddress | |
proxy = request.meta.get('proxy') | |
if proxy: | |
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy) | |
_, _, host, port, proxyParams = _parse(request.url) | |
proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort, | |
timeout=timeout, bindAddress=bindAddress) | |
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint) | |
return agent | |
return self._Agent(reactor, contextFactory=self._contextFactory, | |
connectTimeout=timeout, bindAddress=bindAddress, pool=self._pool) |
Thanks for your code at first.
It's good for the http request, but it doesn't work for https.
Then I follow the steps which scrapy make when it get some https site.
At last, this code can be updated by adding contextFactory=ScrapyClientContextFactory()
to SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
. Then it will be good for the https request.
socks5_http.py
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
line 24
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint, contextFactory=ScrapyClientContextFactory())
Btw, It's not good for socks4. Do not use SOCKS4Agent
until you fix some bugs from txsocksx
.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
i use your code and a problem came out
scrapy's output is
do u have any idea to help me fix it ?
thank u very much!