Last active
October 12, 2015 07:08
-
-
Save robinkraft/3989478 to your computer and use it in GitHub Desktop.
moving giant GBIF species occurrence dataset to S3, prepping for Hadoop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # NB: these are two separate recipes - one for working from the | |
| # dev machine, the other from an EC2 instance | |
| ######################## | |
| # from the dev machine # | |
| ######################## | |
| # split the gzipped occurrence data into 250mb chunks, upload chunks to S3 | |
| # this takes about 10 hours | |
| split -b 250MiB occurrence_20120802.txt.gz occ.gz_ | |
| s3cmd put occ.gz* s3://gbifsource/occurrence/ | |
| ######################## | |
| # from an EC2 instance # | |
| ######################## | |
| # setup | |
| cd /mnt/ | |
| sudo chmod 777 | |
| mkdir data; cd data | |
| # download, reconstruct and decompress gzip file | |
| s3cmd get s3://gbifsource/occurrence/occ.gz* | |
| cat occ.gz* > occ.gz | |
| rm occ.gz* | |
| gunzip occ.gz | |
| # split text file into 250mb chunks, upload chunks to S3 | |
| split -C 250MiB occ occ_ | |
| s3cmd put occ_* s3://gbifsource/occurrence-text/ | |
| # all this stuff on EC2 takes 3-4 hours total on a large instance |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment