This gist lists up the use case and details of using asyncio support in Scrapy
- Use
Python3.7
. - Install asyncio,using
pip install asyncio
. - If you plan to use any asyncio framework, then install them also. For example, for using
aiohttp
,installpip install aiohttp
. - Due to a bug in Twisted, one has to use the developer version of Twisted. Use this github page to clone Twisted in another folder.
- First clone the new repository from this github repo.
- Then create a virtual environment, to install the new version of scrapy locally
virtualenv --python=python3.7 venv
. - Run the command,
. venv/bin/activate
, to start the virtual environment. - Migrate to the folder, where you have cloned Twisted. Run
python3 setup.py install
. - Migrate to the Scrapy folder and run the command to install Scrapy
python3 setup.py install
. - After installing Scrapy, you are good to go, using asyncio based Scrapy.
Scrapy has used callback based programming, so this GSOC project has aimed to support async/await based programming. This helps users in getting the response in the same line, or await
Requests, and get the response in the same line, rather than assigning a callback for dealing with the response.
The project has ensured that users are still able to use callback based programming, so now Scrapy supports both the syntax.
In Scrapy, we create a Spider class, where we start our scraping by initialising the urls in spider method start_requests
. We create a new definition, async def start_request(self)
, and in Python terms, it is an asynchronous generator. Right in previous versions of Python, we had generators which generated urls, this start_request is an asynchronous generator. We can have awaitable
list of urls, from an external sources, but for simple case we have simple iteration of predefined urls, and yield them through scrapy.Request(url, callback=self.parse)
. Here we can yield the scrapy.Request
, and get the response there itself, but we also support callbacks, so we assigned one.
We can create a callback method, in this case async def parse(self, response)
. This accepts the response, and we can deal with it accordingly.
Scrapy provides three ways to make a request, and receive the response.
- The old callback based request is supported, so one can assign a callback and receive a response.
- Yielding
scrapy.Request
in aasync def
method. One can simply yield the result of request like,response = yield scrapy.Request(...)
. Note that this is supported only in methods using async/await syntax. - Awaiting
scrapy.Fetch.Fetch
method. This method accepts a request, and awaits the response as and when it is available. One can write,response = Fetch(scrapy.Request(...))
.
You can create an awaitable method, and create an object and await them. You can refer to Python documentation for more details.
Users can easily use asyncio based frameworks, in Scrapy spider. The following example demonstrates the above -
import scrapy
from scrapy.Fetch import Fetch
import asyncio
import aiohttp
class QuotesSpider(scrapy.Spider):
name = "quotes1"
async def start_requests(self):
urls=['http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
async def parse(self, response):
links = [(response.xpath('//@href').extract()[-1]),(response.xpath('//@href').extract()[-2])]
for link in links:
request = scrapy.Request(url=link)
res = await Fetch(request) # Can use `yield request`
await asyncio.sleep(2)
print("Started the aiohttp module!!")
conn = aiohttp.TCPConnector(verify_ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
html = await self.fetch(session,'https://en.wikipedia.com')
print(html)
print("Completed the aiohttp module!!")
async def fetch(self, session, url):
async with session.get(url) as response:
return await response.text()