Skip to content

Instantly share code, notes, and snippets.

@olopsman
Created October 13, 2019 09:41
Show Gist options
  • Select an option

  • Save olopsman/f353f84f9e2936a38a8a6085a7afa7f0 to your computer and use it in GitHub Desktop.

Select an option

Save olopsman/f353f84f9e2936a38a8a6085a7afa7f0 to your computer and use it in GitHub Desktop.
Sample Python regex
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
def crawl_url(pageUrl):
main_url = "http://books.toscrape.com/"
url = main_url + pageUrl
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
try:
try:
new_url = soup.find("a", {"href":re.compile("page-[0-9]+\.html")})
print(new_url['href'])
catalogue_str = "catalogue/"
if catalogue_str in new_url['href']:
htmlFile = open(new_url['href'], "w")
crawl_url(new_url['href'])
else:
htmlFile = open(catalogue_str + new_url['href'], "w")
crawl_url(catalogue_str + new_url['href'])
htmlFile.write(str(soup))
htmlFile.close()
except AttributeError as e:
print("Crawling finished")
return None
finally:
return None
crawl_url("")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment