Last active
April 4, 2023 04:32
-
-
Save davemac/1eb4b231ab498e2e031bb2b8345a5df1 to your computer and use it in GitHub Desktop.
bash wget a remote URL, then extract the URLs from the anchor tags in that URL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
get from a remote file to STDOUT: | |
wget -qO- https://www.membrane-australasia.org/gallery/imstec-2016-adelaide-part-3/ | | |
grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | | |
grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | | |
uniq | |
get from a remote file to xargs and download each URL | |
wget -qO- https://www.membrane-australasia.org/gallery/imstec-2016-adelaide-part-1/ | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | uniq | xargs -n 1 -P 24 curl -LO | |
local file version: | |
grep -Eoi '<a [^>]+>' file.htm | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | uniq | |
Sources | |
https://unix.stackexchange.com/questions/181254/how-to-use-grep-and-cut-in-script-to-obtain-website-urls-from-an-html-file |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Another easy way to get a list of URLs from a page is to use
lynx
: https://unix.stackexchange.com/a/684704