Skip to content

Instantly share code, notes, and snippets.

@Everfighting
Created July 18, 2019 03:10
Show Gist options
  • Save Everfighting/3bd91eb1559e6421f47c8c33aed04f49 to your computer and use it in GitHub Desktop.
Save Everfighting/3bd91eb1559e6421f47c8c33aed04f49 to your computer and use it in GitHub Desktop.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
# there may be more elements you don't want, such as "style", etc.
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment