Created
June 25, 2013 23:14
-
-
Save leonardreidy/5863337 to your computer and use it in GitHub Desktop.
Python function which takes an input file containing html (in .html or .txt format), and the name of an output file, and uses the BeautifulSoup library to extract the name of the institution stored in a <h2> tag, and the contents of a set of <td> tags that contain profile information stored in a Higher Education Directory.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup | |
def preproc(infile, outfile): | |
#open input file for reading | |
file = open(infile, 'r') | |
#create BeautifulSoup object with the file contents | |
soup = BeautifulSoup(file) | |
#use 'with' syntax to temporarily open the outfile | |
#this way, the interpreter takes care of closing/flushing | |
#the file afterwards | |
with open(outfile, 'w') as file: | |
#find the h2 with the school title and write it | |
file.write(soup('h2')[0].string.encode('utf-8')+",") | |
#iterate through the soup of <tr> tags | |
for i in soup('tr'): | |
#drill down to the contents of each i | |
for j in i: | |
#to avoid throwing errors with NoneTypes | |
#write to file only if the item of interest is not an empty tag | |
if j.string != None: | |
file.write(j.string.encode('utf-8')+",") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment