How to Deep-Clone a Website (Including Lazy-Loaded Images)

When trying to clone or archive a modern web design, standard scraping tools often fail to capture images. This happens because modern frontend frameworks use lazy-loading techniques, storing image URLs in custom attributes like data-background or data-src instead of standard <img src="..."> tags.

Tools like wget only look for standard src attributes and will skip these lazy-loaded assets entirely.

This guide outlines a workaround using mitmproxy to intercept the HTML response and inject hidden image tags, tricking wget into downloading all lazy-loaded assets so the site works perfectly offline.

Prerequisites

wget: Installed and available in your terminal.
mitmproxy: A free and open-source interactive HTTPS proxy.
Python 3: To run the mitmproxy interceptor script.

Step 1: Create the Interceptor Script

Create a Python file named src.py. This script acts as middleware. It scans incoming HTML traffic for common lazy-load attributes, extracts the URLs, and appends them as hidden <img> tags at the bottom of the <body>.

import re
from mitmproxy import http

def response(flow: http.HTTPFlow) -> None:
    # Only intercept and modify HTML responses
    if flow.response and flow.response.content and "text/html" in flow.response.headers.get("Content-Type", ""):
        html = flow.response.text

        # Regex to catch a wide variety of lazy-loading attributes.
        # Add or remove attribute names inside the (?: ) block as needed based on the target framework.
        pattern = r'data-(?:background|bg|lazy|src|image|bg-image|lazy-src|lazy-bg|highres)=["\']([^"\']+)["\']'

        # Find all URLs matching the pattern
        lazy_images = re.findall(pattern, html)

        # Use a set to remove duplicates and ignore any empty strings
        unique_images = set(img for img in lazy_images if img.strip())

        if unique_images:
            # Build all hidden tags into a single string to optimize injection
            hidden_tags = "\n\n\n"
            for img_url in unique_images:
                hidden_tags += f'<img src="{img_url}" style="display:none;" alt="wget-trap">\n'

            # Inject the block right before the closing body tag
            html = html.replace("</body>", f"{hidden_tags}</body>")

            # Return the modified HTML to wget
            flow.response.text = html

Step 2: Start the Proxy Server

Run the Python script through mitmdump (the command-line version of mitmproxy). This will start a local proxy server, typically on port 8080.

mitmdump -s src.py

Leave this terminal window open and running.

Step 3: Run Wget with Proxy Flags

Open a new terminal window and execute your wget command. We need to pass specific environment variables to force wget to route its traffic through our local mitmproxy instance.

wget -e use_proxy=yes \
     -e http_proxy=localhost:8080 \
     -e https_proxy=localhost:8080 \
     --no-check-certificate \
     --mirror \
     --convert-links \
     --adjust-extension \
     --page-requisites \
     --no-parent \
     --progress=dot \
     --recursive \
     --level=6 \
     https://TARGET-WEBSITE.COM/

Understanding the Wget Flags:

Proxy Config (-e ...): Routes HTTP/HTTPS traffic through the mitmproxy script running on port 8080.
--no-check-certificate: Prevents wget from failing when passing HTTPS traffic through the local mitmproxy interceptor.
--mirror: Turns on options suitable for mirroring (recursion, time-stamping, etc.).
--convert-links: After the download is complete, converts links in the document to make them suitable for local, off-line viewing.
--adjust-extension: Adds proper extensions (like .html) to files lacking them.
--page-requisites: Downloads all necessary files to properly display a given HTML page (CSS, images, sounds).
--no-parent: Prevents wget from ascending to the parent directory, keeping the download constrained to the target path.
--level=6: Sets the maximum recursion depth. Adjust this based on how deep the site structure goes.

Disclaimer: Ensure you have the right to download and clone the target website. This script should be used for educational purposes, archiving your own work, or analyzing public design structures.

anburocky3/website-rip.md

Select an option

No results found