You load a "base" URL, and if there are more results than 100, you load subsequent pages till you have ALL results.
You can accomplish this a number of ways, but two stick out for me:
- with recursion:
scrape_it()
sees there are more pages to scrape and calls itself with the next page - with a while loop: you assume you might need multiple fetches and do all the work in a loop that continues to run as long is there a next page
For either method, I mocked up "sample" pages, I hope it's not too abstract, and that you can see there's some data you really care about, reports, and some meta-data that tells you there are more reports to be had, next_url:
source = {
'base_url': {'reports': [1, 2, 3, 4], 'next': 'next_url_1'},
'next1': {'reports': [5, 6, 7, 8], 'next': 'next_url_2'},
'next2': {'reports': [9, 10]}
}
A successful ("complete") scrape of base_url will result in data being [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
.
Typically, recursion has a something called a base case, and it is the signal to not recurse (any further). In your case, no "Next page" is the base case.
def scrape_it(url):
data = []
page = source[url]
next_url = page.get('next')
for report in page['reports']:
data.append(report)
if not next_url: # the "base case", don't recurse
return data
return data + scrape_it(next_url)
data = scrape_it('base_url')
print(data)
Use a while True
loop to keep trying for a next page. When no next page is found, you break out of the loop:
def scrape_it(url):
data = []
while True:
page = source[url]
for report in page['reports']:
data.append(report)
url = page.get('next')
if not url:
break # no "next url", break out of while-loop, and...
return data # you're done
data = scrape_it('base_url')
print(data)