bradmontgomery/ShortIntroToScraping.rst

Created February 21, 2012 02:00

Star (160) You must be signed in to star a gist
Fork (31) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/bradmontgomery/1872970.js"></script>
Save bradmontgomery/1872970 to your computer and use it in GitHub Desktop.

Download ZIP

Really short intro to scraping with Beautiful Soup and Requests

Raw

ShortIntroToScraping.rst

Web Scraping Workshop

Using Requests and Beautiful Soup, with the most recent Beautiful Soup 4 docs.

Getting Started

Install our tools (preferably in a new virtualenv):

pip install beautifulsoup4
pip install requests

Start Scraping!

Lets grab the Free Book Samplers from O'Reilly: http://oreilly.com/store/samplers.html.

>>> import requests
>>>
>>> result = requests.get("http://oreilly.com/store/samplers.html")

Make sure we got a result.

>>> result.status_code
200
>>> result.headers
...

Store your content in an easy-to-type variable!

>>> c = result.content

Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from bs4. If you download the source, you'll need to import from BeautifulSoup (which is what they do in the online docs).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(c)
>>> samples = soup.find_all("a", "item-title")
>>> samples[0]
<a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
Programming Perl
</a>

Now, pick apart individual links.

>>> data = {}
>>> for a in samples:
...     title = a.string.strip()
...     data[title] = a.attrs['href']

Check out the keys/values in the data dict. Rejoice!

Now go scrape some stuff!

fedale commented Apr 9, 2013

Very nice introduction, thanks!

qhuang872 commented Dec 14, 2016

nice short intro

whitecat commented Dec 14, 2016

One question. What if the request is from a website that is loading something in the request? How can I get the request to get the loaded content?

For example request "https://github.com/aptana/Pydev"
and use lines = soup.findAll("span", { "class" : "num text-emphasized" })
The problem is contributor shows: "fetching contributors"

jasminecjc commented Apr 6, 2017

Good job! It helps me

redfast00 commented May 14, 2017

For some reason is using result.content way slower when parsing in BeautifulSoup than using result.text. Any idea why?

trey commented Jun 24, 2017

Thank you, that helped me!

ebartan commented Jul 7, 2017

thx for share very useful start for soup

danhamill commented Jul 27, 2017

Thanks

Mutungi commented Aug 10, 2017

Great introduction thanks

michaelfangyao commented Sep 18, 2017

Good job bro

Renzo1 commented Oct 30, 2017

pls wat is the function of the 'attrs[]' in the last line of the above code

kevinprakasa commented Nov 9, 2017

thanks dude,

hMutzner commented Dec 15, 2017 •

edited

Loading

Very good introduction. Thank you.
Sample site: http://oreilly.com/store/samplers.html does not exist any more

saif017 commented Jun 8, 2018

Tnx big bro

LeeJobs4Med commented Oct 24, 2018

The link is down. you can see a previous version at https://web.archive.org/web/20130209050253/http://oreilly.com/store/samplers.html .

davidxbuck commented Nov 24, 2018

Thanks. It works as intended if you change to the current sampler page: https://www.oreilly.com/free/

hsheikha1429 commented Feb 10, 2020

Well explained.
Thank you.

Jogwums commented Mar 6, 2020 •

edited

Loading

Thanks Nice!
Found a way to export the array created to a csv file.

`import requests
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import csv

results = requests.get("https://www.oreilly.com/free/")

#check if the link is functional
print(results.status_code)

#view headers
print(results.headers)

c = results.content
#apply web scrapping library
soup = BeautifulSoup(c)

#check html element and use to point exact location
samples = soup.find_all("a", "item-title")
samples[0]

#insert a loop to check each iteration and store in an empty dict
data = {}
for a in samples:
    title = a.string.strip()
    data[title] = a.attrs['href']

print(data)

#import to csv
with open('books.csv', 'w') as f:
    for key in data.keys():
        f.write("%s,%s\n"%(key,data[key]))`

kannankumar commented May 6, 2020

great short intro with just the required pieces. 👍

bradmontgomery/ShortIntroToScraping.rst

Web Scraping Workshop

Getting Started

Start Scraping!

fedale commented Apr 9, 2013

Uh oh!

qhuang872 commented Dec 14, 2016

Uh oh!

whitecat commented Dec 14, 2016

Uh oh!

jasminecjc commented Apr 6, 2017

Uh oh!

redfast00 commented May 14, 2017

Uh oh!

trey commented Jun 24, 2017

Uh oh!

ebartan commented Jul 7, 2017

Uh oh!

danhamill commented Jul 27, 2017

Uh oh!

Mutungi commented Aug 10, 2017

Uh oh!

michaelfangyao commented Sep 18, 2017

Uh oh!

Renzo1 commented Oct 30, 2017

Uh oh!

kevinprakasa commented Nov 9, 2017

Uh oh!

hMutzner commented Dec 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saif017 commented Jun 8, 2018

Uh oh!

LeeJobs4Med commented Oct 24, 2018

Uh oh!

davidxbuck commented Nov 24, 2018

Uh oh!

hsheikha1429 commented Feb 10, 2020

Uh oh!

Jogwums commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kannankumar commented May 6, 2020

Uh oh!

hMutzner commented Dec 15, 2017 •

edited

Loading

Jogwums commented Mar 6, 2020 •

edited

Loading