Skip to content

Instantly share code, notes, and snippets.

View erasmas's full-sized avatar
🇺🇦

Dmytro Kobza erasmas

🇺🇦
View GitHub Profile
@erasmas
erasmas / WriteToParquetExample.java
Last active March 29, 2017 05:47
Cascalog workflow to copy data from CSV to Parquet. How do I fix this so that schema fields are not prepended with '?' ?
/**
* Same workflow but using Cascading, output fields in Parquet file are obviously fine and not prepended with '?'
*/
package cascading.sandbox;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
You have an array of integers, and for each index you want to find the product of every integer except the integer at that index.
Write a function get_products_of_all_ints_except_at_index() that takes an array of integers and returns an array of the products.
For example, given:
[1, 7, 3, 4]
your function would return:
@erasmas
erasmas / gist:eaff8ee6f9feb7700622
Created November 12, 2014 16:54
Using aggregations in Cascalog
(def test-tap [["a" "x1" 0 1 0]
["a" "x2" 1 0 1]
["b" "bar" 0 1 0]
["b" "foo" 1 1 0]])
(defaggregatefn dosum
([] 0)
([state val] (+ state val))
([state] [state]))
@erasmas
erasmas / gist:dd94da1404f46abdcdca
Last active August 29, 2015 14:09
Collecting unique values from multiple columns in Cascalog. https://groups.google.com/forum/#!topic/cascalog-user/CjpApzUiwHw
(defparallelagg collect-set
:init-var (mapfn [s] #{s})
:combine-var into
:present-var identity)
(let [set->string (mapfn [s1 s2] (clojure.string/join "," (clojure.set/union s1 s2)))]
(??<- [?id ?fruits]
([["1" "banana" "grape"]
["1" "apple" "apple"]
["1" "apple" "lemon"]
(defn- transform-values [parse-tree values-map]
"Replaces all expressions in parsed tree with values from a given map."
(loop [loc (zip/vector-zip parse-tree)]
(if (zip/end? loc)
(zip/root loc)
(if (zip/branch? loc)
(let [id (last (zip/children loc))]
(if (contains? values-map id)
(recur (zip/next (zip/replace loc (zip/node [(get values-map id)]))))
(recur (zip/next (zip/edit loc #(into [] (butlast %)))))))
@erasmas
erasmas / project.clj
Created December 17, 2014 11:18
project.clj example for Cascalog app
(defproject myapp "0.1.0-SNAPSHOT"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:repositories [["conjars.org" "http://conjars.org/repo"]]
:dependencies [[org.clojure/clojure "1.6.0"]]
:aot [myapp.core]
:main myapp.core
:profiles {:provided {:dependencies [[org.apache.hadoop/hadoop-client "2.4.0"]
[org.apache.hadoop/hadoop-mapreduce-client-core "2.4.0"]]}
@erasmas
erasmas / gist:cbc348b3c95d961a9b19
Created December 26, 2014 15:20
Why combined query doesn't have :available-fields? #cascalog
(let [source1 [[1 2 3]]
source2 [[4 5 6]]
fields ["?a" "?b" "?c"]
query (combine
(<- fields
(source1 :>> fields))
(<- fields
(source2 :>> fields)))]
(:available-fields query))
=> nil
@erasmas
erasmas / gist:b8002b78b2dd51e183ed
Created December 29, 2014 09:43
List all files in HDFS given a glob pattern
(let [fs (FileSystem/getLocal (Configuration.))
path (Path. "/data/in/*.txt.gz")]
(map #(.getPath %) (.globStatus fs path)))
@erasmas
erasmas / wordcount.clj
Created January 16, 2015 10:41
WordCount in Cascalog
(ns cascalog-class.core
(:require [cascalog.api :refer :all]
[cascalog.ops :as c]))
(defmapcatop split
[^String sentence]
(.split sentence "\\s+"))
(def -main
(?<- (stdout)
2015-02-04 12:47:45,300 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(581)) - Using ResourceCalculatorProcessTree : null
2015-02-04 12:47:45,327 INFO [LocalJobRunner Map Task Executor #0] io.MultiInputSplit (MultiInputSplit.java:readFields(161)) - current split input path: file:/tmp/staging/data/staging-path/part-00000
2015-02-04 12:47:45,328 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:updateJobWithSplit(462)) - Processing split: cascading.tap.hadoop.io.MultiInputSplit@29033afe
2015-02-04 12:47:45,333 WARN [LocalJobRunner Map Task Executor #0] io.MultiInputFormat (Util.java:retry(768)) - unable to get record reader, but not retrying
java.io.IOException: file:/tmp/staging/data/staging-path/part-00000 not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1850)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
a