mullnerz/archive-website.md

Last active July 29, 2025 13:28

Star (59) You must be signed in to star a gist
Fork (5) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/mullnerz/9fff80593d6b442d5c1b.js"></script>
Save mullnerz/9fff80593d6b442d5c1b to your computer and use it in GitHub Desktop.

Download ZIP

Archiving a website with wget

Raw

archive-website.md

The command I use to archive a single website

wget -mpck --html-extension --user-agent="" -e robots=off --wait 1 -P . www.foo.com

Explanation of the parameters used

-m (Mirror) Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.
-c (Continue) Resumes a partially-downloaded transfer
-p (Page requisites) Downloads any page dependencies like images, style sheets, etc.
-k (Convert) After completing retrieval of all files… converts all absolute links to other downloaded files into relative links converts all relative links to any files that weren’t downloaded into absolute, external links in a nutshell: makes your website archive work locally
--html-extension this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on
–user-agent=”” Sometimes websites use robots.txt to block certain agents like web crawlers (e.g. GoogleBot) and Wget. This tells Wget to send a blank user-agent, preventing identification. You could alternatively use a web browser’s user-agent and make it look like a web browser, but it probably doesn’t matter.
-e robots=off Sometimes you’ll run into a site with a robots.txt that blocks everything. In these cases, this setting will tell Wget to ignore it. Like the user-agent, I usually leave this on for the sake of convenience.
–wait 1 Tells Wget to wait 1 second between each action. This will make it a bit less taxing on the servers.
-P . set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…) http://url-to-site: this is the full URL of the site to download. You’ll likely want to change this.

Sources

[Archiving a (WordPress) website with wget | D’Arcy Norman dot net] (http://darcynorman.net/2011/12/24/archiving-a-wordpress-website-with-wget/)
[Archiving a Website With Wget] (http://www.dheinemann.com/2011/archiving-with-wget/)

BradKML commented Jun 6, 2022

Question: what is with the limit rate and clobbers in https://gist.github.com/stvhwrd/985dedbe1d3329e68d70 ?
Also this is a good one https://gist.github.com/ziadoz/6873582 and https://gist.github.com/jasperf/6395911

encassion commented Nov 18, 2022 •

edited

Loading

This is the code I use to archive a website and be able to feed it into Yacy as a warc file to add to my search results:

wget \ --mirror \ --warc-file=example.com \ --no-verbose \ --warc-cdx \ --page-requisites \ --adjust-extension \ --convert-links \ --no-warc-compression \ --no-warc-keep-log \ --append-output="example.com" \ --execute robots=off \ https://example.com/

The main changes are to build a file database and simultaneously create a warc archive of the downloaded site structure. Mainly just so I can confirm the content with my file browser before committing it to Yacy Search. If it isn't worth browsing I'll know immediately and cancel it.

oscar230 commented Dec 23, 2022

Thanks. :)

deron-dev commented Apr 10, 2023

replying to @Mayank-1234-cmd

-m includes the -r flag, which enables recursion.

deron-dev commented Apr 10, 2023

Here is a more explicit/readable version:

wget \
    --user-agent="<user-agent>" \
    --header "cookie: <name>=<value>; <name>=<value>; <etc...>" \
    --adjust-extension \
    --continue \
    --convert-links \
    -e robots=off \
    --no-parent \
    --page-requisites \
    --mirror \
    "<url>"

I use --no-parent because I usually do not want anything above the provided URL.

The user-agent for Edge on Windows 10, I believe, if you want to use something generic:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

If you have to be authenticated to the site, you can use the --header flag to include cookies. If not, it can be removed.

To get necessary cookies, you can use dev tools in your browser. Refresh a page you are already authenticated on while tracking requests and get the cookies from the page request.

BradKML commented Apr 11, 2023

@terminal-root are there ways of dealing with the conflict between recursion and mirror? https://gist.github.com/crittermike/fe02c59fed1aeebd0a9697cf7e9f5c0c https://gist.github.com/stvhwrd/985dedbe1d3329e68d70

deron-dev commented Apr 25, 2023

replying to @BrandonKMLee

It does not conflict, to my knowledge; it just includes the recursion flag.

  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing

mariano-daniel commented Jul 15, 2023 •

edited

Loading

I'm having trouble only wgetting this URL : https://forum.spacehey.com/topic?id=3959 and all the links that branch out of it, only 1 level away from the link.

When I run wget -e robots=off --recursive -np -k --html-extension "https://forum.spacehey.com/topic?id=3959"

it starts downloading all of the forums and all the users. I just want topic?id=3959" to be downloaded, plus the user profiles or links that might branch out from the link, but no deeper than 1 level. Not sure if I'm explaining myself correctly.

mullnerz/archive-website.md

The command I use to archive a single website

Explanation of the parameters used

Sources

BradKML commented Jun 6, 2022

Uh oh!

encassion commented Nov 18, 2022 •

edited

Loading

Uh oh!

oscar230 commented Dec 23, 2022

Uh oh!

deron-dev commented Apr 10, 2023

Uh oh!

deron-dev commented Apr 10, 2023

Uh oh!

BradKML commented Apr 11, 2023

Uh oh!

deron-dev commented Apr 25, 2023

Uh oh!

mariano-daniel commented Jul 15, 2023 •

edited

Loading

Uh oh!

mullnerz/archive-website.md

The command I use to archive a single website

Explanation of the parameters used

Sources

BradKML commented Jun 6, 2022

Uh oh!

encassion commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscar230 commented Dec 23, 2022

Uh oh!

deron-dev commented Apr 10, 2023

Uh oh!

deron-dev commented Apr 10, 2023

Uh oh!

BradKML commented Apr 11, 2023

Uh oh!

deron-dev commented Apr 25, 2023

Uh oh!

mariano-daniel commented Jul 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

encassion commented Nov 18, 2022 •

edited

Loading

mariano-daniel commented Jul 15, 2023 •

edited

Loading