Skip to content

Instantly share code, notes, and snippets.

@Smerity
Created June 23, 2015 01:05
Show Gist options
  • Select an option

  • Save Smerity/921858a5f4515bb124ab to your computer and use it in GitHub Desktop.

Select an option

Save Smerity/921858a5f4515bb124ab to your computer and use it in GitHub Desktop.
Collect all URLs for NYTimes in the Common Crawl URL Index
import requests
show_pages = 'http://index.commoncrawl.org/CC-MAIN-2015-18-index?url={query}&output=json&showNumPages=true'
get_page = 'http://index.commoncrawl.org/CC-MAIN-2015-18-index?url={query}&output=json&page={page}'
query = 'nytimes.com/*'
show = requests.get(show_pages.format(query=query))
pages = show.json()['pages']
results = set()
for i in xrange(pages):
print 'Getting page {} of {}'.format(i, pages)
resp = requests.get(get_page.format(query=query, page=i))
for line in resp.content.split('\n'):
results.add(line)
print 'Total results for {query} is {num} unique lines'.format(query=query, num=len(results))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment