Skip to content

Instantly share code, notes, and snippets.

@tomconte
Created July 9, 2014 06:43
Show Gist options
  • Save tomconte/28de871514c5204bdd37 to your computer and use it in GitHub Desktop.
Save tomconte/28de871514c5204bdd37 to your computer and use it in GitHub Desktop.
Complete sample scripts for the article "Analyzing page view logs using Pig on Windows Azure HDInsight"
$wc=New-Object System.Net.WebClient
foreach ($i in 0..23) {
$n = "{0:D2}" -f $i
echo $n
$url = "http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-04/pagecounts-20130401-" + $n + "0000.gz"
$wc.DownloadFile($url, $url.SubString($url.LastIndexOf("/")+1))
}
records = load '/user/admin/test/temp/pagecounts-20130401-*.gz' using PigStorage(' ') as (project:chararray, page:chararray, requests:int, size:int);
filtered_records = filter records by project == 'fr';
villes = load '/user/admin/test/villes.txt' as (ville:chararray);
villes_records = join villes by ville, filtered_records by LOWER(page);
records_by_page = group villes_records by page;
sum_pages = foreach records_by_page generate group, SUM(villes_records.requests);
result = order sum_pages by $1 desc;
store result into 'wp_villes_daily_hits';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment