Skip to content

Instantly share code, notes, and snippets.

@PanJarda
Created November 25, 2016 18:25
Show Gist options
  • Save PanJarda/39cfbdb00483cfe72f772bda4f5a8006 to your computer and use it in GitHub Desktop.
Save PanJarda/39cfbdb00483cfe72f772bda4f5a8006 to your computer and use it in GitHub Desktop.
#!/bin/sh
DOMAIN='www.olomouc.eu'
date +%T
echo $DOMAIN > sitemap.txt
while read url
do
echo "nacitam adresu $url"
wget \
--spider \
--no-parent \
--recursive \
--force-html \
--level=1 \
--no-verbose \
-o urls_found.txt \
--reject *.jpg \
--reject *.css \
--reject *.js \
--reject *.xml \
--reject *.jpg \
--reject *.jpeg \
--reject *.gif \
--reject *.webp \
--reject *.pdf \
--reject *.fla \
--reject *.zip \
--reject *.bz2 \
--reject *.txt \
--no-check-certificate \
$url
echo "nacteno"
sed -ni "s|.\+ URL:\([^ ]\+\) .\+|\1|p" urls_found.txt
sed -i "s|http:\/\/||;s|https:\/\/||" urls_found.txt
grep -vxf sitemap.txt urls_found.txt >> sitemap.txt
done < sitemap.txt
echo -n `date +%T`
echo 'konec'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment