Skip to content

Instantly share code, notes, and snippets.

@robinkraft
robinkraft / gist:4155947
Created November 27, 2012 18:09
generic defbufferop proof of concept for fossa partitioning
(use 'cascalog.api)
(def a (vec (map vector (repeat 20 1) (vec (range 20 0 -1)) (vec (range 50 30 -1)))))
;; [[1 20 50] [1 19 49] [1 18 48] [1 17 47] [1 16 46] [1 15 45] [1 14 44] [1 13 43] [1 12 42] [1 11 41] [1 10 40] [1 9 39] [1 8 38] [1 7 37] [1 6 36] [1 5 35] [1 4 34] [1 3 33] [1 2 32] [1 1 31]]
(defn tester-func
[v]
(apply str (map (partial apply str) v)))
(defbufferop tester
@robinkraft
robinkraft / gist:4081972
Created November 15, 2012 22:39
top 100 species in GBIF database
Sturnus vulgaris 1693733
Passer domesticus 1604252
Zenaida macroura 1497027
Junco hyemalis 1266927
Turdus merula 1254455
Picoides pubescens 1181249
Cardinalis cardinalis 1154519
Erithacus rubecula 1101981
Poecile atricapillus 1086453
Parus major 1085017
@robinkraft
robinkraft / gist:4074703
Created November 14, 2012 20:48
query to get data for David
s3n://pailbucket/all-static-seq/all
s3n://pailbucket/all-prob-series
(<- [?mod-h ?mod-v ?sample ?line ?lat ?lon ?gadm ?vcf ?hansen ?clean-series]
(src ?s-res ?mod-h ?mod-v ?s ?l ?prob-series)
(static-src ?s-res ?mod-h ?mod-v ?sample ?line ?vcf ?gadm _ ?hansen _)
(gadm->iso ?gadm :> ?iso) (o/clean-probs ?prob-series nodata :> ?clean-series)
@robinkraft
robinkraft / gist:4048363
Created November 9, 2012 21:25
recipe for running fossa on EMR
# launch 10-instance cluster - $6-7/hr w/spot
lein emr -s 10 -t high-memory -b 0.75 -bs bsaconfig.xml
# login to cluster
ssh -i ~/.ssh/MoL-hosts.pem hadoop@<insert public DNS>
# get lein
cd bin
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
@robinkraft
robinkraft / gist:4046470
Created November 9, 2012 15:56
problem latlons for MapofLife/fossa
5°52.5'N
7°14'35"N
7°15'36"N
7°15'27"N
7°14'47"N
7°15'27"N
7°15'27"N
12d 40m s W
12d 40m s W
1°05'43"N
@robinkraft
robinkraft / gist:4027857
Created November 6, 2012 21:56
combine part files from multiple directories into one directory - for forma trends
(let [a (hfs-seqfile "s3n://pailbucket/all-trends1")
b (hfs-seqfile "s3n://pailbucket/all-trends2")
c (hfs-seqfile "s3n://pailbucket/all-trends3")
d (hfs-seqfile "s3n://pailbucket/all-trends4")
e (hfs-seqfile "s3n://pailbucket/all-trends-redo2")
out-loc (hfs-seqfile "s3n://pailbucket/all-trends" :sinkmode :replace)]
(?- out-loc (union a b c d e)))
@robinkraft
robinkraft / gist:3990535
Created October 31, 2012 23:08
convert raw rain timeseries from pail to stretched version in sequence file
(use 'forma.hadoop.jobs.scatter)
(require '[forma.source.rain :as rain])
(require '[forma.utils :as u])
(let [nodata -9999.0
src (pail-tap "/mnt/hgfs/robin/delete/timeseries/" "precl" "500" "32")
out-loc (hfs-seqfile "/mnt/hgfs/robin/delete/rain-redo")]
(<- [?s-res ?mod-h ?mod-v ?sample ?line ?new-start ?new-rain]
(src _ ?dc)
(thrift/unpack ?dc :> _ ?loc ?data ?t-res _)
@robinkraft
robinkraft / gist:3989478
Last active October 12, 2015 07:08
moving giant GBIF species occurrence dataset to S3, prepping for Hadoop.
# NB: these are two separate recipes - one for working from the
# dev machine, the other from an EC2 instance
########################
# from the dev machine #
########################
# split the gzipped occurrence data into 250mb chunks, upload chunks to S3
# this takes about 10 hours
split -b 250MiB occurrence_20120802.txt.gz occ.gz_
@robinkraft
robinkraft / gist:3982728
Created October 30, 2012 20:18
aggregating GBIF occurrence data by scientific name
;; see https://github.com/MapofLife/gbifer
;; use with https://s3.amazonaws.com/gbifsource/occ.txt
(use 'gulo.gbif)
(in-ns 'gulo.gbif)
(defn read-occurrences
([]
@robinkraft
robinkraft / gist:3976462
Created October 29, 2012 20:51
count rain pixels by tile
(use 'forma.hadoop.jobs.scatter)
(in-ns 'forma.hadoop.jobs.scatter)
(defn rain-count-by-tile
[in-pail run-key]
(let [rain-src (split-chunk-tap in-pail ["precl" run-key])]
(<- [?mod-h ?mod-v ?count]
(rain-src _ ?dc)
(thrift/unpack ?dc :> _ ?loc _ _ _)
(thrift/unpack ?loc :> _ ?mod-h ?mod-v _ _)