Created
December 14, 2011 00:04
-
-
Save tlehman/1474562 to your computer and use it in GitHub Desktop.
imgscraper
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# a simple image scraper by tlehman | |
# this code is too basic for me to care what you do with it, so have at it. | |
# | |
from urllib import urlopen | |
from BeautifulSoup import BeautifulSoup | |
# usage: getimg(url, filetype) | |
# return will be list of src attributes of a tags in page | |
# referred to by url | |
def getimg(url, filetype): | |
# get html source from url | |
text = urlopen(url).read() | |
# parse the html source using BeautifulSoup | |
soup = BeautifulSoup(text) | |
# set of image urls to be returned | |
imgurls = set() | |
for img in soup.findAll('img'): | |
s=str(img['src']) | |
if s[len(s)-3:] == filetype: | |
imgurls.add(s) | |
return list(imgurls) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This used to be a github repository, but it is too small, so I made it into a gist.