When trying to clone or archive a modern web design, standard scraping tools often fail to capture images. This happens because modern frontend frameworks use lazy-loading techniques, storing image URLs in custom attributes like data-background or data-src instead of standard <img src="..."> tags.
Tools like wget only look for standard src attributes and will skip these lazy-loaded assets entirely.
This guide outlines a workaround using mitmproxy to intercept the HTML response and inject hidden image tags, tricking wget into downloading all lazy-loaded assets so the site works perfectly offline.
- wget: Installed and available in your terminal.
- mitmproxy: A free and open-source interactive HTTPS proxy.
- Python 3: To run the mitmproxy interceptor script.
Create a Python file named src.py. This script acts as middleware. It scans incoming HTML traffic for common lazy-load attributes, extracts the URLs, and appends them as hidden <img> tags at the bottom of the <body>.
import re
from mitmproxy import http
def response(flow: http.HTTPFlow) -> None:
# Only intercept and modify HTML responses
if flow.response and flow.response.content and "text/html" in flow.response.headers.get("Content-Type", ""):
html = flow.response.text
# Regex to catch a wide variety of lazy-loading attributes.
# Add or remove attribute names inside the (?: ) block as needed based on the target framework.
pattern = r'data-(?:background|bg|lazy|src|image|bg-image|lazy-src|lazy-bg|highres)=["\']([^"\']+)["\']'
# Find all URLs matching the pattern
lazy_images = re.findall(pattern, html)
# Use a set to remove duplicates and ignore any empty strings
unique_images = set(img for img in lazy_images if img.strip())
if unique_images:
# Build all hidden tags into a single string to optimize injection
hidden_tags = "\n\n\n"
for img_url in unique_images:
hidden_tags += f'<img src="{img_url}" style="display:none;" alt="wget-trap">\n'
# Inject the block right before the closing body tag
html = html.replace("</body>", f"{hidden_tags}</body>")
# Return the modified HTML to wget
flow.response.text = htmlRun the Python script through mitmdump (the command-line version of mitmproxy). This will start a local proxy server, typically on port 8080.
mitmdump -s src.pyLeave this terminal window open and running.
Open a new terminal window and execute your wget command. We need to pass specific environment variables to force wget to route its traffic through our local mitmproxy instance.
wget -e use_proxy=yes \
-e http_proxy=localhost:8080 \
-e https_proxy=localhost:8080 \
--no-check-certificate \
--mirror \
--convert-links \
--adjust-extension \
--page-requisites \
--no-parent \
--progress=dot \
--recursive \
--level=6 \
https://TARGET-WEBSITE.COM/- Proxy Config (
-e ...): Routes HTTP/HTTPS traffic through the mitmproxy script running on port 8080. --no-check-certificate: Prevents wget from failing when passing HTTPS traffic through the local mitmproxy interceptor.--mirror: Turns on options suitable for mirroring (recursion, time-stamping, etc.).--convert-links: After the download is complete, converts links in the document to make them suitable for local, off-line viewing.--adjust-extension: Adds proper extensions (like.html) to files lacking them.--page-requisites: Downloads all necessary files to properly display a given HTML page (CSS, images, sounds).--no-parent: Prevents wget from ascending to the parent directory, keeping the download constrained to the target path.--level=6: Sets the maximum recursion depth. Adjust this based on how deep the site structure goes.
Disclaimer: Ensure you have the right to download and clone the target website. This script should be used for educational purposes, archiving your own work, or analyzing public design structures.