Skip to content

Instantly share code, notes, and snippets.

@greg-randall
Last active March 9, 2022 14:40
Show Gist options
  • Select an option

  • Save greg-randall/e17214fef5bb0a22584bcaa0626a79be to your computer and use it in GitHub Desktop.

Select an option

Save greg-randall/e17214fef5bb0a22584bcaa0626a79be to your computer and use it in GitHub Desktop.
Generate a list of all the urls on a website. Ignores common files like jpg, pdf, js, css, etc.
#example: ./site_map.sh asdf.com
SITE=$1
URLS="$1_urls.txt"
echo -e "Starting Sitemap Generation===============================================\n"
echo "" > site_url_list_raw.txt
wget --no-check-certificate --spider --recursive --level=inf -erobots=off --no-verbose --show-progress --reject jpg,jpeg,png,gif,svg,webp,css,js,woff,ttf,eot,pdf --output-file=site_url_list_raw.txt $SITE
grep -i 'url:' site_url_list_raw.txt | perl -pe 's/^\d.+URL:\s?//i' | perl -pe 's/ .+//i' | grep -vi 'wp-json' | sort -u > $URLS
PAGES=` wc -l $URLS | awk '{print $1}' `
echo -e "\nSitemap Generation Complete\nFound $PAGES pages\nFull report $URLS\n"
rm site_url_list_raw.txt
rm -r $1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment