Skip to content

Instantly share code, notes, and snippets.

@nrrb
Created September 19, 2014 19:53
Show Gist options
  • Select an option

  • Save nrrb/42a3143a01207c6ebc6f to your computer and use it in GitHub Desktop.

Select an option

Save nrrb/42a3143a01207c6ebc6f to your computer and use it in GitHub Desktop.
Some utility functions for working with the Python module lxml when parsing HTML
from urlparse import urljoin
import lxml.html
def clean_xpath(path):
"""
When copying XPath from the developer console of Firefox or Chrome,
the browser inserts 'tbody' tags in table declarations which are
generally not there in the original source. These need to be removed
in order for lxml to work with said XPath expression.
"""
return path.replace('tbody/', '')
def links_to_dicts(links, base_url=None):
"""
Takes a list of lxml.html.HtmlElement elements representing <a> links,
and returns a list of dictionaries representing those links with each
dict containing the name and url of the corresponding <a> link.
"""
return [{'name': link.text, 'url': urljoin(base_url, link.attrib['href'])} for link in links]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment