This notebook is mainly inspired by Bo's tweet about using Jina Reader to follow a website's sitemap.xml
for check grounding.
Fetching a single webpage is easy, but fetching multiple pages or even an entire site? A simple for-loop is straightforward but not necessarily the most efficient method. What about running a full batch of parallel requests to the Reader API? Well, you'll hit the rate limit in no time.
This notebook serves as a great example, demonstrating how to use asyncio
and Semaphore
to efficiently query the Reader API. Get what you need without hitting the rate limit.
!pip install aiofiles !pip install nest_asyncio
We will use async inside notebook, which requires some tweaks below.
import aiohttp import asyncio import aiofiles import xml.etree.ElementTree as ET import json import nest_asyncio
nest_asyncio.apply() progress_lock = asyncio.Lock()
Next we define three auxliary functions:
-
fetch(session, url, headers, semaphore)
: Fetches a URL with specified headers, ensuring proper content type, within the semaphore limit. -
fetch_content(session, url, semaphore, progress, total, api_key=None)
: Fetches content from a URL in HTML, markdown, and default format, updating progress accordingly. -
fetch_sitemap_urls(sitemap_url)
: Retrieves all URLs listed in a sitemap XML from the given sitemap URL.
Finally, fetch_all_content(sitemap_url, api_key)
: This is our main entrypoint. Fetches and processes content from all URLs in a sitemap, saving the results to a JSON file with the format below:
[
{"html": ...,
"markdown": ...,
"default": ...
}
]
from ast import NodeTransformer
async def fetch(session, url, headers, semaphore):
async with semaphore:
async with session.get(url, headers=headers) as response:
response.raise_for_status() # Ensure we raise an error for bad responses
if 'application/json' in response.headers.get('Content-Type'):
return await response.json()
else:
raise aiohttp.ContentTypeError(
request_info=response.request_info,
history=response.history,
message=f"Unexpected content type: {response.headers.get('Content-Type')}"
)
async def fetch_content(session, url, semaphore, progress, total, api_key):
headers_common = {
"Accept": "application/json",
}
if api_key:
headers_common["Authorization"] = f"Bearer {api_key}"
headers_html = headers_common.copy()
headers_html["X-Return-Format"] = "html"
headers_markdown = headers_common.copy()
headers_markdown["X-Return-Format"] = "markdown"
try:
url1 = f"https://r.jina.ai/{url}"
# full html before the filtering pipeline, consume MOST tokens!
# comment this out if u dont need it!
response_html = fetch(session, url1, headers_html, semaphore)
# html->markdown but without smart filtering
# comment this out if u dont need it!
response_markdown = fetch(session, url1, headers_markdown, semaphore)
# default content behavior as if u access via https://r.jina.ai/url
# comment this out if u dont need it!
response_default = fetch(session, url1, headers_common, semaphore)
html, markdown, default = await asyncio.gather(response_html, response_markdown, response_default)
result = {
'url': url,
'default': default.get('data').get('content'),
'html': html.get('data').get('html'),
'markdown': markdown.get('data').get('content'),
}
except aiohttp.ContentTypeError as e:
print(f"Skipping URL due to content type error: {url}")
result = {
'url': url,
'default': None,
'html': None,
'markdown': None,
}
async with progress_lock:
progress['completed'] += 1
print(f"Completed {progress['completed']} out of {total} requests")
return result
async def fetch_sitemap_urls(sitemap_url):
async with aiohttp.ClientSession() as session:
async with session.get(sitemap_url) as response:
response.raise_for_status()
sitemap_xml = await response.text()
root = ET.fromstring(sitemap_xml)
urls = [elem.text for elem in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]
return urls
async def fetch_all_content(sitemap_url, api_key=None, max_concurrency=5):
urls = await fetch_sitemap_urls(sitemap_url)
total_urls = len(urls)
progress = {'completed': 0}
semaphore = asyncio.Semaphore(max_concurrency) # Limit the number of concurrent tasks to 100
async with aiohttp.ClientSession() as session:
tasks = [fetch_content(session, url, semaphore, progress, total_urls, api_key) for url in urls]
results = await asyncio.gather(*tasks)
async with aiofiles.open('website.json', 'w') as f:
await f.write(json.dumps(results, indent=4))
# Run w/ & w/o API Key
Reader API accepts w/ and w/o API key. However, the rate limit with API key is **200 requests per min (RPM)**, whereas without the key you only can have **20 RPM**. Since API key is free to get from the website and every new API key has 1 million free tokens. It makes more sense to use an API key! Please check https://jina.ai/reader#apiform to get your API key.
## Without API Key
Let's first run the code without API key. Let's just use this random innocent website `jina.ai` and fetch its sitemap, hopefully their webmaster wont complain about it.
sitemap_url = "https://jina.ai/sitemap.xml"
await fetch_all_content(sitemap_url, max_concurrency=3)
Hmm, we only ran 7 requests and it already hit the rate limit? I just mentioned that without an API key, the rate limit is 20 RPM. What's going on here?
If you look into our `fetch_content` function, you'll find it internally calls the Reader API three times:
```python
html, markdown, default = await asyncio.gather(response_html, response_markdown, response_default)
So, 3 x 7 = 21
, and that's why it hits the 20 RPM rate limit.
We can either remove the async runs we don't need or add an API key.
I just applied for a free 1M token key from the Jina AI website and copied it below. However, by the time you run this notebook, all 1M tokens in this key are probably consumed already. So you may need to get a new key.
Now that we have an API key, we can go bullish and set max_concurrency=10
!
sitemap_url = "https://jina.ai/sitemap.xml"
await fetch_all_content(sitemap_url, api_key='jina_fd455547319d4057809186abfa89d22975L7a1mgzYgAXTcuHkfyYC433GTP', max_concurrency=10)
Hope this helps you use Jina Reader for better grounding!