Last active
April 4, 2023 04:32
-
-
Save davemac/1eb4b231ab498e2e031bb2b8345a5df1 to your computer and use it in GitHub Desktop.
bash wget a remote URL, then extract the URLs from the anchor tags in that URL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
get from a remote file to STDOUT: | |
wget -qO- https://www.membrane-australasia.org/gallery/imstec-2016-adelaide-part-3/ | | |
grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | | |
grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | | |
uniq | |
get from a remote file to xargs and download each URL | |
wget -qO- https://www.membrane-australasia.org/gallery/imstec-2016-adelaide-part-1/ | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | uniq | xargs -n 1 -P 24 curl -LO | |
local file version: | |
grep -Eoi '<a [^>]+>' file.htm | grep -Eo 'href="[^\"]+"' | grep -Eo '(http|https)://[a-zA-Z0-9./?=_-]*' | uniq | |
Sources | |
https://unix.stackexchange.com/questions/181254/how-to-use-grep-and-cut-in-script-to-obtain-website-urls-from-an-html-file |
Thanks for the info, much appreciated. I'm curious, how did you come across my gist?
Another easy way to get a list of URLs from a page is to use lynx
: https://unix.stackexchange.com/a/684704
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
According to that link your script is not correct. It might work in 98% cases, but it doesn't mean it can be reliable. Yup, that's it.
I would suggest to use ElementTree from Python or any other XML parser for this. Actually in the shell you may find nice XPath compatible parsers.