Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.
Install our tools (preferably in a new virtualenv):
pip install beautifulsoup4 pip install requests
Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.
>>> import requests
>>>
>>> result = requests.get("http://oreilly.com/store/samplers.html")
Make sure we got a result.
>>> result.status_code 200 >>> result.headers ...
Store your content in an easy-to-type variable!
>>> c = result.content
Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ##("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html.parser")
markup_type=markup_type))
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(c, "html.parser") >>> samples = soup.find_all("a", "item-title") >>> samples[0] <a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf"> Programming Perl </a>
Now, pick apart individual links.
>>> data = {}
>>> for a in samples:
... title = a.string.strip()
... data[title] = a.attrs['href']
Check out the keys/values in the data dict. Rejoice!
Now go scrape some stuff!