Skip to content

Instantly share code, notes, and snippets.

@lmandel
Created November 20, 2013 15:39
Show Gist options
  • Save lmandel/7565187 to your computer and use it in GitHub Desktop.
Save lmandel/7565187 to your computer and use it in GitHub Desktop.
#!/bin/sh
DATE=`date +%Y%m%d`
ALEXA_URL=http://www.alexa.com/topsites/countries%3B
COUNTRY_CODE=$1
OUTPUT_FILE=ALEXA_${COUNTRY_CODE}-${DATE}.txt
echo "Downloading Alexa top site data for $COUNTRY_CODE"
touch $OUTPUT_FILE
i=0
for i in {0..19}
do
curl "${ALEXA_URL}${i}/${COUNTRY_CODE}" | grep "<span class=\"small topsites-label\">" | sed 's/<span class=\"small topsites-label\">\(.*\)<\/span>/\"\1\"/' >> $OUTPUT_FILE
i=$i+1
done
wc -l $OUTPUT_FILE
@lmandel
Copy link
Author

lmandel commented Jan 8, 2014

Retrieves the Alexa top ~500 sites for a specific country.

Execute as

./getAlexaCountryData.sh COUNTRY_CODE

ex.

./getAlexaCountryData.sh JP

@hallvors
Copy link

hallvors commented Apr 2, 2014

the regexp needs a little update (site markup changed), see https://gist.github.com/hallvors/9931106 .

Will probably need more updates shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment