Last active
May 6, 2021 05:59
-
-
Save baileywickham/6a6355140f988d17fd7175403d5ff794 to your computer and use it in GitHub Desktop.
wasup kelso
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# https://architizer.com/sitemap-firms.xml | |
# Step 1: Download the above link to the file 'sitemap-firms.xml' and put it in the same directory as the python script. | |
# Step 2: Run the python script. It's sorta hacked together so it only half works, but it should do the basics. You can filter | |
# the list more if you want. | |
import xml.etree.ElementTree as ET | |
import requests | |
import re | |
email_regex = 'mailto:\S+@\S+\.\S+' | |
tree = ET.parse('sitemap-firms.xml') | |
root = tree.getroot() | |
urls = [] | |
for url in root: | |
for item in url: | |
if item.tag[-3:] == 'loc': | |
urls.append(item.text) | |
rsp = '' | |
for url in urls: | |
rsp = requests.get(url) | |
m = re.findall(email_regex, str(rsp.content)) | |
# probably the url | |
s = m[0] | |
# cut the mailto part | |
print(s[s.find(":")+1:s.find('"')]) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The link at the top is the first of a couple "firms" pages. you can find them all here https://architizer.com/sitemap.xml
just search for "firms" in that page and download each of the firms pages into the directory before running the script.