Skip to content

Instantly share code, notes, and snippets.

@anburocky3
Created March 24, 2026 12:07
Show Gist options
  • Select an option

  • Save anburocky3/0bf06c92550ae0785b7ec98298b3ed9c to your computer and use it in GitHub Desktop.

Select an option

Save anburocky3/0bf06c92550ae0785b7ec98298b3ed9c to your computer and use it in GitHub Desktop.
How to Deep-Clone a Website (Including Lazy-Loaded Images)

How to Deep-Clone a Website (Including Lazy-Loaded Images)

When trying to clone or archive a modern web design, standard scraping tools often fail to capture images. This happens because modern frontend frameworks use lazy-loading techniques, storing image URLs in custom attributes like data-background or data-src instead of standard <img src="..."> tags.

Tools like wget only look for standard src attributes and will skip these lazy-loaded assets entirely.

This guide outlines a workaround using mitmproxy to intercept the HTML response and inject hidden image tags, tricking wget into downloading all lazy-loaded assets so the site works perfectly offline.

Prerequisites

  1. wget: Installed and available in your terminal.
  2. mitmproxy: A free and open-source interactive HTTPS proxy.
  3. Python 3: To run the mitmproxy interceptor script.

Step 1: Create the Interceptor Script

Create a Python file named src.py. This script acts as middleware. It scans incoming HTML traffic for common lazy-load attributes, extracts the URLs, and appends them as hidden <img> tags at the bottom of the <body>.

import re
from mitmproxy import http

def response(flow: http.HTTPFlow) -> None:
    # Only intercept and modify HTML responses
    if flow.response and flow.response.content and "text/html" in flow.response.headers.get("Content-Type", ""):
        html = flow.response.text

        # Regex to catch a wide variety of lazy-loading attributes.
        # Add or remove attribute names inside the (?: ) block as needed based on the target framework.
        pattern = r'data-(?:background|bg|lazy|src|image|bg-image|lazy-src|lazy-bg|highres)=["\']([^"\']+)["\']'

        # Find all URLs matching the pattern
        lazy_images = re.findall(pattern, html)

        # Use a set to remove duplicates and ignore any empty strings
        unique_images = set(img for img in lazy_images if img.strip())

        if unique_images:
            # Build all hidden tags into a single string to optimize injection
            hidden_tags = "\n\n\n"
            for img_url in unique_images:
                hidden_tags += f'<img src="{img_url}" style="display:none;" alt="wget-trap">\n'

            # Inject the block right before the closing body tag
            html = html.replace("</body>", f"{hidden_tags}</body>")

            # Return the modified HTML to wget
            flow.response.text = html

Step 2: Start the Proxy Server

Run the Python script through mitmdump (the command-line version of mitmproxy). This will start a local proxy server, typically on port 8080.

mitmdump -s src.py

Leave this terminal window open and running.


Step 3: Run Wget with Proxy Flags

Open a new terminal window and execute your wget command. We need to pass specific environment variables to force wget to route its traffic through our local mitmproxy instance.

wget -e use_proxy=yes \
     -e http_proxy=localhost:8080 \
     -e https_proxy=localhost:8080 \
     --no-check-certificate \
     --mirror \
     --convert-links \
     --adjust-extension \
     --page-requisites \
     --no-parent \
     --progress=dot \
     --recursive \
     --level=6 \
     https://TARGET-WEBSITE.COM/

Understanding the Wget Flags:

  • Proxy Config (-e ...): Routes HTTP/HTTPS traffic through the mitmproxy script running on port 8080.
  • --no-check-certificate: Prevents wget from failing when passing HTTPS traffic through the local mitmproxy interceptor.
  • --mirror: Turns on options suitable for mirroring (recursion, time-stamping, etc.).
  • --convert-links: After the download is complete, converts links in the document to make them suitable for local, off-line viewing.
  • --adjust-extension: Adds proper extensions (like .html) to files lacking them.
  • --page-requisites: Downloads all necessary files to properly display a given HTML page (CSS, images, sounds).
  • --no-parent: Prevents wget from ascending to the parent directory, keeping the download constrained to the target path.
  • --level=6: Sets the maximum recursion depth. Adjust this based on how deep the site structure goes.

Disclaimer: Ensure you have the right to download and clone the target website. This script should be used for educational purposes, archiving your own work, or analyzing public design structures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment