Skip to content

Instantly share code, notes, and snippets.

@jatrost
Created February 13, 2020 01:00
Show Gist options
  • Save jatrost/1a0edfc388e598d3ed3afbc77c64725d to your computer and use it in GitHub Desktop.
Save jatrost/1a0edfc388e598d3ed3afbc77c64725d to your computer and use it in GitHub Desktop.
Downloads the historic top Alexa domains from the Way Back Machine (Internet Archive)
#!/bin/bash
# makes this MacOS compatible
DATE_CMD=$(which gdate || which date)
for DAY_AGO in {0..30};
do
DATE_FILE=$(${DATE_CMD} -d "$DAY_AGO days ago" +%F)
DATE_URL=$(${DATE_CMD} -d "$DAY_AGO days ago" +%F | sed 's/-//g')
if [ -e "${DATE_FILE}-top-1m.csv" ]
then
echo "Skipping $DATE_FILE since file exists ..."
else
URL="https://web.archive.org/web/${DATE_URL}/http://s3.amazonaws.com/alexa-static/top-1m.csv.zip"
curl -L "$URL" --output "${DATE_FILE}-top-1m.csv.zip" && \
unzip "${DATE_FILE}-top-1m.csv.zip" && \
mv top-1m.csv "${DATE_FILE}-top-1m.csv" && \
rm "${DATE_FILE}-top-1m.csv.zip"
fi
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment