Created
January 14, 2015 15:32
-
-
Save trevor-atlas/2ceadcb39c93fca94aa7 to your computer and use it in GitHub Desktop.
wget
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## | |
# Here are a couple of recipes to download and archive an entire Web site, starting with the given page and recursing down. | |
# | |
# Pitfalls | |
# As of 2008, WGet doesn't follow @import links in CSS. | |
# | |
# Credit to http://lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php | |
# And http://www.veen.com/jeff/archives/000573.html | |
# Get page.com and each page it links to as well as linked assets like images and CSS. Change hyperlinks to point to the locally downloaded pages. Adjust how many levels deep by changing the numeric argument given after -l | |
wget -pkr -l 1 http://site | |
# Same as above but also follow links to other domains. | |
wget -Hpkr -l 1 http://site | |
# Same as the first example, but use a cookie | |
wget -pkr -l 1 --no-cookies --header "Cookie: JSESSIONID=12345" https://securesite | |
# Mirror an html site. | |
# Read time-stamps when overwriting files that already exist. | |
# Wait about 10 seconds beteen tries | |
wget -m -N -w10 --random-wait http://site | |
# Behave very badly by ignoring the robots.txt directive. | |
# And spoof Mozilla. | |
# Also output is appended to site.com.log | |
wget -m -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/ | |
wget -pkr -l 1 -N -w10 --random-wait -erobots=off -a site.com.log --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090214" http://site.com/ | |
# Then of course you can see the current output from wget with | |
tail -f site.com.log |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment