Last active
November 20, 2020 06:57
-
-
Save imweijh/34433e9c5459a745ef3160d42082a2e3 to your computer and use it in GitHub Desktop.
dirty web scraping szfdc huarun4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
用法:安装anaconda | |
python szfdc.py > hr4.txt | |
一次一栋,抓其他的要自己改改 urlz 链接 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from urllib.request import urlopen | |
from bs4 import BeautifulSoup | |
import re | |
urlz = "http://zjj.sz.gov.cn/ris/bol/szfdc/building.aspx?id=38063&presellid=49373" | |
htmlz = urlopen(urlz) | |
soupz = BeautifulSoup(htmlz, 'lxml') | |
all_links = soupz.find_all("a",href=re.compile('housedetail')) | |
# 默认只抓了前12个 | |
# for link in all_links[:]: | |
for link in all_links[:12]: | |
theurl = "http://zjj.sz.gov.cn/ris/bol/szfdc/" + link.get("href") | |
#print (theurl) | |
thehtml = urlopen(theurl) | |
thesoup = BeautifulSoup(thehtml, 'lxml') | |
tds = thesoup.find_all('td') | |
mystr = "\t".join([tds[1].text.strip(),tds[11].text.strip(),tds[7].text.strip(),tds[15].text.strip(),tds[17].text.strip(),tds[19].text.strip(),tds[13].text.strip()]) | |
print(mystr) | |
print (urlz) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment