Created
November 20, 2013 22:23
-
-
Save leonardreidy/7572218 to your computer and use it in GitHub Desktop.
Python BS4 fragments for extracting content from a recent web-based JAM.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup | |
# open the infile for reading | |
file = open(infile, 'r') | |
# convert the contents of the infile to a Beautiful Soup object | |
soup = BeautifulSoup(file) | |
# create lists, a list containing bs4.element.Tag items generated by using | |
# the .select() syntax - the texts and their author names are contained in | |
# li elements, that nest divs containing the stuff of interest | |
lists = soup.select('li.message-container-li') | |
# extract the message text for the first Tag in the lists list | |
lists[0].div.find('div', {"class":"message-text cf"}).get_text().encode("utf-8") | |
# extract the message author name | |
lists[0].a.get_text().encode("utf-8") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment