Skip to content

Instantly share code, notes, and snippets.

@aligusnet
Last active December 11, 2021 19:56
Show Gist options
  • Select an option

  • Save aligusnet/6478289 to your computer and use it in GitHub Desktop.

Select an option

Save aligusnet/6478289 to your computer and use it in GitHub Desktop.
Download a weather dataset from the National Climatic Data Center (NCDC, http://www .ncdc.noaa.gov/). Prepare it for examples of "Hadoop: The Definitive Guide" book by Tom White. http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520 Usage: ./ncdc.sh 1901 1930 # download wheather datasets for period from 1901 to 1930.
#!/usr/bin/env bash
# global parameters
g_tmp_folder="ncdc_tmp";
g_output_folder="ncdc_data";
g_remote_host="ftp.ncdc.noaa.gov";
g_remote_path="pub/data/noaa";
# $1: folder_path
function create_folder {
if [ -d "$1" ]; then
rm -rf "$1";
fi
mkdir "$1"
}
# $1: year to download
function download_data {
local source_url="ftp://$g_remote_host/$g_remote_path/$1"
wget -r -c -q --no-parent -P "$g_tmp_folder" "$source_url";
}
# $1: year to process
function process_data {
local year="$1"
local local_path="$g_tmp_folder/$g_remote_host/$g_remote_path/$year"
local tmp_output_file="$g_tmp_folder/$year"
for file in $local_path/*; do
gunzip -c $file >> "$tmp_output_file"
done
zipped_file="$g_output_folder/$year.gz"
gzip -c "$tmp_output_file" >> "$zipped_file"
echo "created file: $zipped_file"
rm -rf "$local_path"
rm "$tmp_output_file"
}
# $1 - start year
# $2 - finish year
function main {
local start_year=1901
local finish_year=1920
if [ -n "$1" ]; then
start_year=$1
fi
if [ -n "$2" ]; then
finish_year=$2
fi
create_folder $g_tmp_folder
create_folder $g_output_folder
for year in `seq $start_year $finish_year`; do
download_data $year
process_data $year
done
rm -rf "$g_tmp_folder"
}
main $1 $2
@smitakl

smitakl commented Apr 3, 2014

Copy link
Copy Markdown

Thanks for the script. BTW, pub/data/noaa path doesn't seem to be valid.

@crush-157

Copy link
Copy Markdown

FTP location has changed.

It is now ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

So line 7 should be:

g_remote_host="ftp.ncdc.noaa.gov";

@tomasdelvechio

Copy link
Copy Markdown

Thanks! The script with the change from crush-157 works fine!

@tirru

tirru commented Aug 26, 2014

Copy link
Copy Markdown

perfectly worked.

sudo bash ncdc.sh

user@ubuntuvm:~/climateData/ncdc_data$ ls
1901.gz 1904.gz 1907.gz 1910.gz 1913.gz 1916.gz 1919.gz
1902.gz 1905.gz 1908.gz 1911.gz 1914.gz 1917.gz 1920.gz
1903.gz 1906.gz 1909.gz 1912.gz 1915.gz 1918.gz

@rehevkor5

Copy link
Copy Markdown

Download location has changed again. Also, I have introduced changes so the script does not try to run process_data on files that have not been downloaded, and prints information about failed downloads to stderr. https://gist.github.com/rehevkor5/2e407950ca687b36fc54

@jithinodattu

Copy link
Copy Markdown

Thank you

@sasikirankarri

Copy link
Copy Markdown

Thanks for the valuable script and valuable edit by crush-157 😄

@BhavaniCP

Copy link
Copy Markdown

thank you so much

@forisg

forisg commented Sep 3, 2015

Copy link
Copy Markdown

Changed again to:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/

@tushar-chandra-030389

Copy link
Copy Markdown

Thanks

@lohithn4

lohithn4 commented Apr 7, 2016

Copy link
Copy Markdown

cool great for this work

@zhounanshu

Copy link
Copy Markdown

Thank you very much! :)

@ichigeki

ichigeki commented Jun 2, 2016

Copy link
Copy Markdown

Thanks Alexander. This is great. 👍

@huidi7

huidi7 commented Aug 16, 2016

Copy link
Copy Markdown

Thanks. Cool.

@holphi

holphi commented Jan 9, 2017

Copy link
Copy Markdown

That's really helpful! Thank you guys!

@aligusnet

Copy link
Copy Markdown
Author

thanks for your comments and special thanks to @crush-157 for the fix.

@AnayBhowmik

Copy link
Copy Markdown

Thanks a lot
worked perfectly

@danieldai

Copy link
Copy Markdown

Thanks, It still works

@binshi

binshi commented Nov 22, 2017

Copy link
Copy Markdown

I am running on mac and I get
gzip: ncdc_tmp/ftp.ncdc.noaa.gov/pub/data/noaa/1921/*.gz: No such file or directory
created file: ncdc_data/1921.gz

The above was due to non connectivity to internet. My bad.

@danieldai

Copy link
Copy Markdown

Thanks, it works

@BinitaBharati

Copy link
Copy Markdown

Great script, works beautifully.The ftp server location has also been updated in the script, so nothing needs to be edited, the script works as it is.

@yogirain

yogirain commented Apr 7, 2018

Copy link
Copy Markdown

Works excellant without a change, Thanks.

@engkimbs

Copy link
Copy Markdown

Awesome! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment