Last active
January 8, 2018 17:41
-
-
Save jmw1040/ef8b72c96e84a392a2b620a663f4c0a1 to your computer and use it in GitHub Desktop.
CMSC 122 - WebScraping
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### CSMC 122 | |
#### January 8, 2018 | |
---------- | |
### Beautiful Soup | |
#### Lecture 1 (Week 1 Fri, 2018-01-05) | |
##### urllib2: | |
We will use urllib2 (a package to import) to scrape websites | |
urllib2.urlparse.urlparse(url) | |
ParseResult(scheme=‘http’, | |
netloc=‘www.classes.cs.uchicago.edu’. | |
path = fullpath, | |
params = ‘, | |
query=‘ | |
fragment=“ | |
##### HTML: | |
<h1> headers </h1> | |
<p> paragraph | |
class= “courseblocktitle” | |
for style | |
links <a href=“url”> The college </a> | |
<img style”height: 120px;” alt = “” src=“images/freelunch.png”. | |
##### HTML Tables: | |
<table> | |
<tr> Table Row | |
<th>…</th> Header | |
<td>…</td> Table Data | |
</tr> | |
<table> | |
##### Beautiful Soup Intro | |
import bs4 | |
import urllib2 | |
html_string = open(“courses.html”).read() | |
or_from_net = urllib2.urlopen(“http://www.cs.ucicago.edu/“).read() | |
soup = bs4.BeautifulSoup(html_string) | |
soup.title # returns html tag) | |
soup.title.text # (returns unicode text | |
links = soup.find_all(“a”) | |
> Note: `lenl(links)` gives you a list of all links. And `links[100]` returns the 100th link | |
#### Lecture 2 (Week 2 Mon, 2018-01-08) | |
def find_sequence(tag): | |
''' | |
If tag is the header for a sequence, then find the tags for the courses in the sequence | |
''' | |
rv = [] | |
sib_tag = tag.next_sibling | |
while is_subsequence(sib_tag) or sib_tag == u'\n': | |
if sib_tag != u'\n': | |
rv.append(sib_tag) | |
sib_tag = sib_tag.next_sibling | |
return rv | |
Example: (Sometimes you don't find stuff) | |
>>> find_sequnce(course_divs[0]) | |
[] | |
>>> find_sequence(course_divs[1]) | |
[] | |
>>> find_sequence(course_divs[2]) | |
[ | |
<div class="courseblock subsequence"> | |
<p class='courseblocktitle'> | |
<strong>CMSC10500. | |
-------- | |
### Project | |
- The project should be data-oriented. We can gather this on the web via webscraping. | |
- There should be a graphical component |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment