jmw1040 · January 8, 2018 17:41
diff --git a/CMSC 122 - Webscraping b/CMSC 122 - Webscraping

 ### CSMC 122
 #### January 8, 2018

 ----------
 ### Beautiful Soup
 #### Lecture 1 (Week 1 Fri, 2018-01-05)

 ##### urllib2:
 We will use urllib2 (a package to import) to scrape websites

 urllib2.urlparse.urlparse(url)

 	ParseResult(scheme=‘http’,
 	netloc=‘www.classes.cs.uchicago.edu’.
 	path = fullpath,
 	params = ‘,
 	query=‘
 	fragment=“

 ##### HTML:
 	<h1> headers </h1>
 	<p> paragraph
 	class= “courseblocktitle” 
 	for style
 	
 	links <a href=“url”> The college </a>
 	<img style”height: 120px;” alt = “” src=“images/freelunch.png”.

 ##### HTML Tables:
 	<table>
 	<tr> Table Row
 	<th>…</th>  Header
 	<td>…</td> Table Data
 	</tr>
 	<table>


 ##### Beautiful Soup Intro

 	import bs4
 	import urllib2
 	html_string = open(“courses.html”).read()
 	or_from_net = urllib2.urlopen(“http://www.cs.ucicago.edu/“).read()
 	soup = bs4.BeautifulSoup(html_string)
 	
 	soup.title # returns html tag)
 	soup.title.text # (returns unicode text
 	links = soup.find_all(“a”)

 > Note: `lenl(links)` gives you a list of all links. And `links[100]` returns the 100th link

 #### Lecture 2 (Week 2 Mon, 2018-01-08)

 	def find_sequence(tag):
 	'''
 	If tag is the header for a sequence, then find the tags for the courses in the sequence
 	'''
 	rv = []
 	sib_tag = tag.next_sibling
 	while is_subsequence(sib_tag) or sib_tag == u'\n':
 		if sib_tag != u'\n':
 			rv.append(sib_tag)
 		sib_tag = sib_tag.next_sibling
 	return rv

 Example: (Sometimes you don't find stuff)

 	>>> find_sequnce(course_divs[0])
 	[]
 	>>> find_sequence(course_divs[1])
 	[]
 	>>> find_sequence(course_divs[2])
 	[
 	<div class="courseblock subsequence">
 	<p class='courseblocktitle'>
 	<strong>CMSC10500.

 --------
 ### Project
 - The project should be data-oriented. We can gather this on the web via webscraping. 
 - There should be a graphical component

	### CSMC 122
	#### January 8, 2018

	----------
	### Beautiful Soup
	#### Lecture 1 (Week 1 Fri, 2018-01-05)

	##### urllib2:
	We will use urllib2 (a package to import) to scrape websites

	urllib2.urlparse.urlparse(url)

	ParseResult(scheme=‘http’,
	netloc=‘www.classes.cs.uchicago.edu’.
	path = fullpath,
	params = ‘,
	query=‘
	fragment=“

	##### HTML:
	<h1> headers </h1>
	<p> paragraph
	class= “courseblocktitle”
	for style

	links <a href=“url”> The college </a>
	<img style”height: 120px;” alt = “” src=“images/freelunch.png”.

	##### HTML Tables:
	<table>
	<tr> Table Row
	<th>…</th> Header
	<td>…</td> Table Data
	</tr>
	<table>


	##### Beautiful Soup Intro

	import bs4
	import urllib2
	html_string = open(“courses.html”).read()
	or_from_net = urllib2.urlopen(“http://www.cs.ucicago.edu/“).read()
	soup = bs4.BeautifulSoup(html_string)

	soup.title # returns html tag)
	soup.title.text # (returns unicode text
	links = soup.find_all(“a”)

	> Note: `lenl(links)` gives you a list of all links. And `links[100]` returns the 100th link

	#### Lecture 2 (Week 2 Mon, 2018-01-08)

	def find_sequence(tag):
	'''
	If tag is the header for a sequence, then find the tags for the courses in the sequence
	'''
	rv = []
	sib_tag = tag.next_sibling
	while is_subsequence(sib_tag) or sib_tag == u'\n':
	if sib_tag != u'\n':
	rv.append(sib_tag)
	sib_tag = sib_tag.next_sibling
	return rv

	Example: (Sometimes you don't find stuff)

	>>> find_sequnce(course_divs[0])
	[]
	>>> find_sequence(course_divs[1])
	[]
	>>> find_sequence(course_divs[2])
	[
	<div class="courseblock subsequence">
	<p class='courseblocktitle'>
	<strong>CMSC10500.

	--------
	### Project
	- The project should be data-oriented. We can gather this on the web via webscraping.
	- There should be a graphical component