Skip to content

Instantly share code, notes, and snippets.

Created January 28, 2015 13:56
Show Gist options
  • Select an option

  • Save anonymous/53d90be18a44f742496a to your computer and use it in GitHub Desktop.

Select an option

Save anonymous/53d90be18a44f742496a to your computer and use it in GitHub Desktop.
spider
import requests
import bs4
import csv
url = []
url.append('http://www.ntust.edu.tw/files/40-1000-167-')
url.append('.php')
data = []
tmp = 'title', 'date', 'content'
data.append(tmp)
for page in range(7):
response = requests.get(str(page).join(url))
soup = bs4.BeautifulSoup(response.content)
newstitle = soup.select('.M39 .module-ptlist .h5 a')
newsdate = soup.select('.M39 .module-ptlist .h5 .date')
newscontent = soup.select('.message p')
for i in range(len(newstitle)):
tmp = newstitle[i].text,newsdate[i].text.replace('[',' ').replace(']',' ').strip(),newscontent[i].text
data.append(tmp)
f = open("news.csv","w")
w = csv.writer(f)
w.writerows(data)
f.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment