Skip to content

Instantly share code, notes, and snippets.

@mark5280
Last active July 21, 2023 13:03
Show Gist options
  • Save mark5280/4b3fd0e608d1c9c31db3d28ceee014f8 to your computer and use it in GitHub Desktop.
Save mark5280/4b3fd0e608d1c9c31db3d28ceee014f8 to your computer and use it in GitHub Desktop.
A Python script to replace relative paths in src/href attributes in a HTML string with absolute path equivalents. Inspiration was provided by https://gist.github.com/sprintingdev/8843526. See additional notes below.
# In development of this script I had four goals:
# 1) Operate on a string fragment of HTML and assume it could be poorly formed (ie - might not parse well with lxml, etc.)
# 2) Ability to call the function with base_url since it would be used in the processing of multiple sites and HTML fragments
# from them
# 3) Desire to use urlparse.urljoin to insure that the resulting absolute url was well formed
# 4) Even if the HTML fragment contains absolute URLs I could pass it through this routine without issue
# Starting with https://gist.github.com/sprintingdev/8843526 I found that certain
# relative URLs would, when made absolute, were not always well formed. Often there
# could be multiple slashes in the URL path. To combat this urlparse.urljoin was introduced.
#
# Additionally, due to my needs, I needed the flexibilty to pass base_url into the function called by
# re.sub function. To do this I used partials.
import re
from functools import partial
from urlparse import urljoin
def srcrepl(base_url, match):
absolute_link = urljoin(base_url, match.group(3))
return "<" + match.group(1) + match.group(2) + "=" + "\"" + absolute_link + "\"" + match.group(4) + ">"
def relative_to_absolute_urls(fragment, base_url):
p = re.compile(r"<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>")
absolute_fragment = p.sub(partial(srcrepl, base_url), fragment)
return absolute_fragment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment