Skip to content

Instantly share code, notes, and snippets.

@baileywickham
Last active May 6, 2021 05:59
Show Gist options
  • Save baileywickham/6a6355140f988d17fd7175403d5ff794 to your computer and use it in GitHub Desktop.
Save baileywickham/6a6355140f988d17fd7175403d5ff794 to your computer and use it in GitHub Desktop.
wasup kelso
# https://architizer.com/sitemap-firms.xml
# Step 1: Download the above link to the file 'sitemap-firms.xml' and put it in the same directory as the python script.
# Step 2: Run the python script. It's sorta hacked together so it only half works, but it should do the basics. You can filter
# the list more if you want.
import xml.etree.ElementTree as ET
import requests
import re
email_regex = 'mailto:\S+@\S+\.\S+'
tree = ET.parse('sitemap-firms.xml')
root = tree.getroot()
urls = []
for url in root:
for item in url:
if item.tag[-3:] == 'loc':
urls.append(item.text)
rsp = ''
for url in urls:
rsp = requests.get(url)
m = re.findall(email_regex, str(rsp.content))
# probably the url
s = m[0]
# cut the mailto part
print(s[s.find(":")+1:s.find('"')])
@baileywickham
Copy link
Author

baileywickham commented May 6, 2021

The link at the top is the first of a couple "firms" pages. you can find them all here https://architizer.com/sitemap.xml

just search for "firms" in that page and download each of the firms pages into the directory before running the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment