Skip to content

Instantly share code, notes, and snippets.

@calthoff
Created December 4, 2019 03:29
Show Gist options
  • Select an option

  • Save calthoff/a9e75f900e4f1f1d4adc6f3ed3da4a8e to your computer and use it in GitHub Desktop.

Select an option

Save calthoff/a9e75f900e4f1f1d4adc6f3ed3da4a8e to your computer and use it in GitHub Desktop.
import urllib.request
from bs4 import BeautifulSoup
class Scraper:
def __init__(self, site):
self.site = site
def scrape(self):
response = urllib.request.urlopen(self.site)
html = response.read()
sp = BeautifulSoup(html, 'html.parser')
with open("output.txt", "w") as f:
for tag in sp.find_all('a'):
url = tag.get('href')
if url and 'html' in url:
print("\n" + url)
f.write(url + "\n")
Scraper('https://news.google.com/').scrape()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment