Skip to content

Instantly share code, notes, and snippets.

@jmw1040
Last active January 8, 2018 17:41
Show Gist options
  • Save jmw1040/ef8b72c96e84a392a2b620a663f4c0a1 to your computer and use it in GitHub Desktop.
Save jmw1040/ef8b72c96e84a392a2b620a663f4c0a1 to your computer and use it in GitHub Desktop.
CMSC 122 - WebScraping
### CSMC 122
#### January 8, 2018
----------
### Beautiful Soup
#### Lecture 1 (Week 1 Fri, 2018-01-05)
##### urllib2:
We will use urllib2 (a package to import) to scrape websites
urllib2.urlparse.urlparse(url)
ParseResult(scheme=‘http’,
netloc=‘www.classes.cs.uchicago.edu’.
path = fullpath,
params = ‘,
query=‘
fragment=“
##### HTML:
<h1> headers </h1>
<p> paragraph
class= “courseblocktitle”
for style
links <a href=“url”> The college </a>
<img style”height: 120px;” alt = “” src=“images/freelunch.png”.
##### HTML Tables:
<table>
<tr> Table Row
<th>…</th> Header
<td>…</td> Table Data
</tr>
<table>
##### Beautiful Soup Intro
import bs4
import urllib2
html_string = open(“courses.html”).read()
or_from_net = urllib2.urlopen(“http://www.cs.ucicago.edu/“).read()
soup = bs4.BeautifulSoup(html_string)
soup.title # returns html tag)
soup.title.text # (returns unicode text
links = soup.find_all(“a”)
> Note: `lenl(links)` gives you a list of all links. And `links[100]` returns the 100th link
#### Lecture 2 (Week 2 Mon, 2018-01-08)
def find_sequence(tag):
'''
If tag is the header for a sequence, then find the tags for the courses in the sequence
'''
rv = []
sib_tag = tag.next_sibling
while is_subsequence(sib_tag) or sib_tag == u'\n':
if sib_tag != u'\n':
rv.append(sib_tag)
sib_tag = sib_tag.next_sibling
return rv
Example: (Sometimes you don't find stuff)
>>> find_sequnce(course_divs[0])
[]
>>> find_sequence(course_divs[1])
[]
>>> find_sequence(course_divs[2])
[
<div class="courseblock subsequence">
<p class='courseblocktitle'>
<strong>CMSC10500.
--------
### Project
- The project should be data-oriented. We can gather this on the web via webscraping.
- There should be a graphical component
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment