Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.
Install our tools (preferably in a new virtualenv):
pip install beautifulsoup4 pip install requests
Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.
>>> import requests >>> >>> result = requests.get("http://oreilly.com/store/samplers.html")
Make sure we got a result.
>>> result.status_code 200 >>> result.headers ...
Store your content in an easy-to-type variable!
>>> c = result.content
Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4
. If you download the source, you'll need to import from BeautifulSoup
(which is what they do in the online docs).
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ##("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")
markup_type=markup_type))
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(c, "html.parser") >>> samples = soup.find_all("a", "item-title") >>> samples[0] <a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf"> Programming Perl </a>
Now, pick apart individual links.
>>> data = {} >>> for a in samples: ... title = a.string.strip() ... data[title] = a.attrs['href']
Check out the keys/values in the data
dict. Rejoice!
Now go scrape some stuff!