Last active
July 21, 2023 13:03
-
-
Save mark5280/4b3fd0e608d1c9c31db3d28ceee014f8 to your computer and use it in GitHub Desktop.
A Python script to replace relative paths in src/href attributes in a HTML string with absolute path equivalents. Inspiration was provided by https://gist.github.com/sprintingdev/8843526. See additional notes below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# In development of this script I had four goals: | |
# 1) Operate on a string fragment of HTML and assume it could be poorly formed (ie - might not parse well with lxml, etc.) | |
# 2) Ability to call the function with base_url since it would be used in the processing of multiple sites and HTML fragments | |
# from them | |
# 3) Desire to use urlparse.urljoin to insure that the resulting absolute url was well formed | |
# 4) Even if the HTML fragment contains absolute URLs I could pass it through this routine without issue | |
# Starting with https://gist.github.com/sprintingdev/8843526 I found that certain | |
# relative URLs would, when made absolute, were not always well formed. Often there | |
# could be multiple slashes in the URL path. To combat this urlparse.urljoin was introduced. | |
# | |
# Additionally, due to my needs, I needed the flexibilty to pass base_url into the function called by | |
# re.sub function. To do this I used partials. | |
import re | |
from functools import partial | |
from urlparse import urljoin | |
def srcrepl(base_url, match): | |
absolute_link = urljoin(base_url, match.group(3)) | |
return "<" + match.group(1) + match.group(2) + "=" + "\"" + absolute_link + "\"" + match.group(4) + ">" | |
def relative_to_absolute_urls(fragment, base_url): | |
p = re.compile(r"<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>") | |
absolute_fragment = p.sub(partial(srcrepl, base_url), fragment) | |
return absolute_fragment |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment