Skip to content

Instantly share code, notes, and snippets.

@sengupta
Created January 16, 2012 18:46
Show Gist options
  • Save sengupta/1622276 to your computer and use it in GitHub Desktop.
Save sengupta/1622276 to your computer and use it in GitHub Desktop.
Email Scraper

Simple Email Scraper

This contains two files, scraper.sh and scraper.py.

scraper.sh is useful for web pages where email addresses are visible on the rendered page. scraper.py is useful for web pages where email addresses are available anywhere within the HTML (and more expensive).

Usage

./scraper.sh http://example.com

Or

./scraper.py http://example.com
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
#!/usr/bin/python
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
import re
import sys
import urllib
url = urllib.urlopen(sys.argv[1])
response = url.read()
regex = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
emails = regex.findall(response)
with open('emails.csv', 'w+') as email_file:
email_file.write('\n'.join(set(emails)))
#/bin/bash
# This program is free software. It comes without any warranty, to
# the extent permitted by applicable law. You can redistribute it
# and/or modify it under the terms of the Do What The Fuck You Want
# To Public License, Version 2, a copy of which is provided in the
# file LICENSE.txt.
# Enclose the line below in a loop to have it scrape over multiple pages of a site.
# This line currently scrapes one page to pull out emails.
curl -s "$1" | sed 's/<[^>]*>//g' | sed -e 's/^[ \t]*//' | sed 's/&nbsp;//' | grep -srhw "[[:alnum:]_.-]\+@[[:alnum:]_.-]\+" >> emails.csv
@letronje
Copy link

Does it work as expected for http://www.thoughtworks.com/contact-us ?

@sengupta
Copy link
Author

@letronje: Should work now.

@antoniotrento
Copy link

Epic license!! Can I ask some questions? I'm looking to create a python bot that is able to read links from a list such as a csv file find mail and other keywords in the page passed and create a row in another csv file with scraped data. There is Python libs that can do in my case?

I'm new of python do something like that can waste a lot of time for me...

How much it will cost a dev like this in your opinion?

Thanks for your attention.

Antonio

@stefanpejcic
Copy link

stefanpejcic commented Dec 19, 2019

#For Python3

import re
import sys
import urllib
import urllib.request

with urllib.request.urlopen(sys.argv[1]) as url:
    response = url.read()

regex = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')

emails = regex.findall(response)
with open('emails.csv', 'w+') as email_file: 
    email_file.write('\n'.join(set(emails)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment