Created
November 20, 2013 15:39
-
-
Save lmandel/7565187 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
DATE=`date +%Y%m%d` | |
ALEXA_URL=http://www.alexa.com/topsites/countries%3B | |
COUNTRY_CODE=$1 | |
OUTPUT_FILE=ALEXA_${COUNTRY_CODE}-${DATE}.txt | |
echo "Downloading Alexa top site data for $COUNTRY_CODE" | |
touch $OUTPUT_FILE | |
i=0 | |
for i in {0..19} | |
do | |
curl "${ALEXA_URL}${i}/${COUNTRY_CODE}" | grep "<span class=\"small topsites-label\">" | sed 's/<span class=\"small topsites-label\">\(.*\)<\/span>/\"\1\"/' >> $OUTPUT_FILE | |
i=$i+1 | |
done | |
wc -l $OUTPUT_FILE |
the regexp needs a little update (site markup changed), see https://gist.github.com/hallvors/9931106 .
Will probably need more updates shortly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Retrieves the Alexa top ~500 sites for a specific country.
Execute as
./getAlexaCountryData.sh COUNTRY_CODE
ex.
./getAlexaCountryData.sh JP