Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active December 1, 2021 02:13
Show Gist options
  • Save zacharysyoung/044cc471403f504cdc8adf75225d87f2 to your computer and use it in GitHub Desktop.
Save zacharysyoung/044cc471403f504cdc8adf75225d87f2 to your computer and use it in GitHub Desktop.

Two ways to get all (sub)pages

You load a "base" URL, and if there are more results than 100, you load subsequent pages till you have ALL results.

You can accomplish this a number of ways, but two stick out for me:

  • with recursion: scrape_it() sees there are more pages to scrape and calls itself with the next page
  • with a while loop: you assume you might need multiple fetches and do all the work in a loop that continues to run as long is there a next page

For either method, I mocked up "sample" pages, I hope it's not too abstract, and that you can see there's some data you really care about, reports, and some meta-data that tells you there are more reports to be had, next_url:

source = {
    'base_url': {'reports': [1, 2, 3, 4], 'next': 'next_url_1'},
    'next1':    {'reports': [5, 6, 7, 8], 'next': 'next_url_2'},
    'next2':    {'reports': [9, 10]}
}

A successful ("complete") scrape of base_url will result in data being [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].

Recursion

Typically, recursion has a something called a base case, and it is the signal to not recurse (any further). In your case, no "Next page" is the base case.

def scrape_it(url):
    data = []

    page = source[url]
    next_url = page.get('next')

    for report in page['reports']:
        data.append(report)

    if not next_url:  # the "base case", don't recurse
        return data

    return data + scrape_it(next_url)


data = scrape_it('base_url')
print(data)

While (there are "next" pages) loop

Use a while True loop to keep trying for a next page. When no next page is found, you break out of the loop:

def scrape_it(url):
    data = []

    while True:
        page = source[url]

        for report in page['reports']:
            data.append(report)

        url = page.get('next')
        if not url:
            break  # no "next url", break out of while-loop, and...

    return data    # you're done


data = scrape_it('base_url') 
print(data)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment