Skip to content

Instantly share code, notes, and snippets.

@dublado
Created May 3, 2023 16:34
Show Gist options
  • Save dublado/87905136f0b79affeb448cd8d017167a to your computer and use it in GitHub Desktop.
Save dublado/87905136f0b79affeb448cd8d017167a to your computer and use it in GitHub Desktop.
wget spider crawler
#!/bin/bash
# Replace with the URL of the sitemap index
SITEMAP_INDEX_URL="https://example.com/sitemap_index.xml"
# Download the sitemap index
wget -q -O sitemap_index.xml "$SITEMAP_INDEX_URL"
# Extract the sitemap URLs
SITEMAP_URLS=$(grep -oP '<loc>\K[^<]+' sitemap_index.xml)
# Loop through each sitemap URL and spider it using wget
for url in $SITEMAP_URLS; do
echo "Spidering sitemap: $url" >> spider.log
wget --recursive --level=1 --no-directories --no-check-certificate --spider -o - "$url" 2>&1 | grep -B 1 "^HTTP" | tee -a spider.log
done
# Remove the downloaded sitemap index
rm sitemap_index.xml
echo "Spidering completed." >> spider.log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment