Skip to content

Instantly share code, notes, and snippets.

View Quantisan's full-sized avatar

Paul Lam Quantisan

View GitHub Profile
@Quantisan
Quantisan / output
Created August 24, 2012 20:55
Impatient part 2
$ cat output/rain/part-00000
A 3
Australia 1
Broken 1
California's 1
DVD 1
Death 1
Land 1
Secrets 1
This 2
@Quantisan
Quantisan / run log
Created August 28, 2012 17:58
Impatient part 6
aul-Lams-computer:part6 paullam$ hadoop jar target/impatient.jar data/rain.txt output/wc data/en.stop output/tfidf output/trap output/check
2012-08-28 18:52:15.457 java[16966:1903] Unable to load realm info from SCDynamicStore
12/08/28 18:52:16 INFO util.HadoopUtil: using default application jar, may cause class not found exceptions on the cluster
12/08/28 18:52:16 INFO planner.HadoopPlanner: using application jar: /Users/paullam/Dropbox/Projects/Impatient/part6/target/impatient.jar
12/08/28 18:52:16 INFO property.AppProps: using app.id: D5424D7B027EC9418FCADE8F3552429B
12/08/28 18:52:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/08/28 18:52:16 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/28 18:52:16 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/28 18:52:16 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/28 18:52:16 INFO util.Version: Concurrent, Inc - Cascading
@Quantisan
Quantisan / output
Created August 29, 2012 21:28
Impatient part 4
$ cat output/wc/part-00000
air 1
area 4
australia 1
broken 1
california's 1
cause 1
cloudcover 1
death 1
deserts 1
# read in data
c <- read.csv("data/cartier_sample_likes.csv", header=F)
names(c) <- c("user.id", "page.id")
s <- read.csv("data/swarovski_sample_likes.csv", header=F)
names(s) <- c("user.id", "page.id")
p <- read.csv("data/page_labels.csv", header=F)
names(p) <- c("page.id", "followers", "name")
require(stringr)
p$name <- as.factor(str_trim(as.character(p$name))) ## trim whitespace
@Quantisan
Quantisan / run log
Created October 6, 2012 14:04
Impatient part 5
$ hadoop jar ./target/impatient.jar data/rain.txt output/wc data/en.stop output/tfidf
2012-10-06 15:00:25.269 java[1097:1903] Unable to load realm info from SCDynamicStore
12/10/06 15:00:25 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.core
12/10/06 15:00:25 INFO planner.HadoopPlanner: using application jar: /Users/paullam/Dropbox/Projects/Impatient/part5/./target/impatient.jar
12/10/06 15:00:25 INFO property.AppProps: using app.id: 63CBE2FEBFE8177789403D9EA7C81366
12/10/06 15:00:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/10/06 15:00:25 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/06 15:00:25 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/06 15:00:25 INFO mapred.FileInputFormat: Total input paths to process : 1
@Quantisan
Quantisan / batch-import
Created October 15, 2012 10:21
neo4j batch import benchmark
Paul-Lams-computer:batch-import paullam$ java TestDataGenerator
Creating 7500000 and 191215478 Relationships took 146 seconds.
Paul-Lams-computer:batch-import paullam$ head nodes.csv
Node Rels Property Counter:int
0 12 TEST 0
1 14 TEST 1
2 25 TEST 2
3 28 TEST 3
4 4 TEST 4
5 7 TEST 5
@Quantisan
Quantisan / variety-output
Created October 18, 2012 17:34
mongo schema analyzer
$ mongo twitter --eval "var collection = 'energy'" variety.js
MongoDB shell version: 2.2.0
connecting to: twitter
Variety: A MongoDB Schema Analyzer
Version 1.2.1, released 29 July 2012
Using limit of 582
Using maxDepth of 99
creating results collection: energyKeys
removing leaf arrays in results collection, and getting percentages
{ "_id" : { "key" : "_id" }, "value" : { "type" : "ObjectId" }, "totalOccurrences" : 582, "percentContaining" : 100 }
@Quantisan
Quantisan / coin.R
Last active December 10, 2015 22:29
illustrating statistical type I and II errors
test.flips <- function(N, A.prop=0.5, B.prop=0.5, attr="p.value") {
heads.A <- rbinom(1, N, A.prop)
heads.B <- rbinom(1, N, B.prop)
test <- prop.test(c(heads.A, heads.B), n=c(N, N), alternative="two.sided")
return(as.numeric(test[attr]))
}
## vary number of flips
N <- seq(1, 1001, by=10)
@Quantisan
Quantisan / right-join-recent.clj
Last active December 21, 2015 03:08
Cascalog pattern to right join only the most recent record to left
;; TEST
(deftest join-right-recent-test
(let [left (memory-source-tap ["?timestamp" "?uscc" "?x"]
[[1000 "u1" "AAA"]
[2000 "u2" "BBB"]])
right (memory-source-tap ["?timestamp" "?uscc" "?y" "?z"]
[[500 "u1" "a1" "a2"]
[800 "u1" "b1" "b2"]
[1100 "u1" "c1" "c2"]])]
@Quantisan
Quantisan / field.clj
Created August 20, 2013 12:49
Cascalog field name manipulation convenience methods
(ns etl.field
"Cascalog field name and variable functions."
(:refer-clojure :exclude (replace))
(:use [clojure.string :only (replace)]))
(defn replace-field [coll match replacement]
(let [idx (.indexOf coll match)]
(assoc coll idx replacement)))
(defn- agg-replace [suffix]