Skip to content

Instantly share code, notes, and snippets.

@aligusnet
Last active December 11, 2021 19:56
Show Gist options
  • Select an option

  • Save aligusnet/6478289 to your computer and use it in GitHub Desktop.

Select an option

Save aligusnet/6478289 to your computer and use it in GitHub Desktop.
Download a weather dataset from the National Climatic Data Center (NCDC, http://www .ncdc.noaa.gov/). Prepare it for examples of "Hadoop: The Definitive Guide" book by Tom White. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 Usage: ./ncdc.sh 1901 1930 # download wheather datasets for period from 1901 to 1930.
#!/usr/bin/env bash
# global parameters
g_tmp_folder="ncdc_tmp";
g_output_folder="ncdc_data";
g_remote_host="ftp.ncdc.noaa.gov";
g_remote_path="pub/data/noaa";
# $1: folder_path
function create_folder {
if [ -d "$1" ]; then
rm -rf "$1";
fi
mkdir "$1"
}
# $1: year to download
function download_data {
local source_url="ftp://$g_remote_host/$g_remote_path/$1"
wget -r -c -q --no-parent -P "$g_tmp_folder" "$source_url";
}
# $1: year to process
function process_data {
local year="$1"
local local_path="$g_tmp_folder/$g_remote_host/$g_remote_path/$year"
local tmp_output_file="$g_tmp_folder/$year"
for file in $local_path/*; do
gunzip -c $file >> "$tmp_output_file"
done
zipped_file="$g_output_folder/$year.gz"
gzip -c "$tmp_output_file" >> "$zipped_file"
echo "created file: $zipped_file"
rm -rf "$local_path"
rm "$tmp_output_file"
}
# $1 - start year
# $2 - finish year
function main {
local start_year=1901
local finish_year=1920
if [ -n "$1" ]; then
start_year=$1
fi
if [ -n "$2" ]; then
finish_year=$2
fi
create_folder $g_tmp_folder
create_folder $g_output_folder
for year in `seq $start_year $finish_year`; do
download_data $year
process_data $year
done
rm -rf "$g_tmp_folder"
}
main $1 $2
@smitakl
Copy link
Copy Markdown

smitakl commented Apr 3, 2014

Thanks for the script. BTW, pub/data/noaa path doesn't seem to be valid.

@crush-157
Copy link
Copy Markdown

FTP location has changed.

It is now ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

So line 7 should be:

g_remote_host="ftp.ncdc.noaa.gov";

@tomasdelvechio
Copy link
Copy Markdown

Thanks! The script with the change from crush-157 works fine!

@tirru
Copy link
Copy Markdown

tirru commented Aug 26, 2014

perfectly worked.

sudo bash ncdc.sh

user@ubuntuvm:~/climateData/ncdc_data$ ls
1901.gz 1904.gz 1907.gz 1910.gz 1913.gz 1916.gz 1919.gz
1902.gz 1905.gz 1908.gz 1911.gz 1914.gz 1917.gz 1920.gz
1903.gz 1906.gz 1909.gz 1912.gz 1915.gz 1918.gz

@rehevkor5
Copy link
Copy Markdown

Download location has changed again. Also, I have introduced changes so the script does not try to run process_data on files that have not been downloaded, and prints information about failed downloads to stderr. https://gist.github.com/rehevkor5/2e407950ca687b36fc54

@jithinodattu
Copy link
Copy Markdown

Thank you

@sasikirankarri
Copy link
Copy Markdown

Thanks for the valuable script and valuable edit by crush-157 😄

@BhavaniCP
Copy link
Copy Markdown

thank you so much

@forisg
Copy link
Copy Markdown

forisg commented Sep 3, 2015

Changed again to:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

@tushar-chandra-030389
Copy link
Copy Markdown

Thanks

@lohithn4
Copy link
Copy Markdown

lohithn4 commented Apr 7, 2016

cool great for this work

@zhounanshu
Copy link
Copy Markdown

Thank you very much! :)

@ichigeki
Copy link
Copy Markdown

ichigeki commented Jun 2, 2016

Thanks Alexander. This is great. 👍

@huidi7
Copy link
Copy Markdown

huidi7 commented Aug 16, 2016

Thanks. Cool.

@holphi
Copy link
Copy Markdown

holphi commented Jan 9, 2017

That's really helpful! Thank you guys!

@aligusnet
Copy link
Copy Markdown
Author

thanks for your comments and special thanks to @crush-157 for the fix.

@AnayBhowmik
Copy link
Copy Markdown

Thanks a lot
worked perfectly

@danieldai
Copy link
Copy Markdown

Thanks, It still works

@binshi
Copy link
Copy Markdown

binshi commented Nov 22, 2017

I am running on mac and I get
gzip: ncdc_tmp/ftp.ncdc.noaa.gov/pub/data/noaa/1921/*.gz: No such file or directory
created file: ncdc_data/1921.gz

The above was due to non connectivity to internet. My bad.

@danieldai
Copy link
Copy Markdown

Thanks, it works

@BinitaBharati
Copy link
Copy Markdown

Great script, works beautifully.The ftp server location has also been updated in the script, so nothing needs to be edited, the script works as it is.

@yogirain
Copy link
Copy Markdown

yogirain commented Apr 7, 2018

Works excellant without a change, Thanks.

@engkimbs
Copy link
Copy Markdown

Awesome! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment