Created
September 19, 2014 19:53
-
-
Save nrrb/42a3143a01207c6ebc6f to your computer and use it in GitHub Desktop.
Some utility functions for working with the Python module lxml when parsing HTML
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from urlparse import urljoin | |
| import lxml.html | |
| def clean_xpath(path): | |
| """ | |
| When copying XPath from the developer console of Firefox or Chrome, | |
| the browser inserts 'tbody' tags in table declarations which are | |
| generally not there in the original source. These need to be removed | |
| in order for lxml to work with said XPath expression. | |
| """ | |
| return path.replace('tbody/', '') | |
| def links_to_dicts(links, base_url=None): | |
| """ | |
| Takes a list of lxml.html.HtmlElement elements representing <a> links, | |
| and returns a list of dictionaries representing those links with each | |
| dict containing the name and url of the corresponding <a> link. | |
| """ | |
| return [{'name': link.text, 'url': urljoin(base_url, link.attrib['href'])} for link in links] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment