Skip to content

Instantly share code, notes, and snippets.

@bobquest33
Created June 4, 2017 06:55
Show Gist options
  • Select an option

  • Save bobquest33/8e66d3372ab26cc162687fa65ef0d200 to your computer and use it in GitHub Desktop.

Select an option

Save bobquest33/8e66d3372ab26cc162687fa65ef0d200 to your computer and use it in GitHub Desktop.
Extracting Email IDs from a Html Page using Beautiful Soup, html2text, Regular Expression https://bigdatacv.com/currentjobs/
import requests
from bs4 import BeautifulSoup
r = requests.get("https://bigdatacv.com/currentjobs/")
content = r.text
soup = BeautifulSoup(content, 'html.parser')
print(soup.prettify())
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
[s.extract() for s in soup('img')]
[s.extract() for s in soup('a')]
print(soup.text)
import html2text
text = html2text.html2text(soup.prettify())
print(text)
import re
# Code from https://github.com/fredericpierron/extract-email-from-text-python-3/
regex = re.compile(("([a-z0-9!#$%&*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
"{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
"\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))
def get_emails(s):
"""Returns an iterator of matched emails found in string s."""
# Removing lines that start with '//' because the regular expression
# mistakenly matches patterns like 'http://[email protected]' as '//[email protected]'.
return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))
for email in get_emails(text):
print (email)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment