- Cube is awesome
- streaming dashboards, yay
- etsy says so
- flexible, pretty, and streaming: meets our needs
- what's so great?
- metrics and events
- event data
- metrics and events
- metrics are calculation caches
| package com.infochimps.kafka.consumers; | |
| import java.io.IOException; | |
| import java.io.File; | |
| import java.io.FileInputStream; | |
| import java.io.InputStream; | |
| import java.nio.ByteBuffer; | |
| import java.util.HashMap; | |
| import java.util.List; | |
| import java.util.Map; |
Assuming these goals, from most to least important:
- that you, shaving finally learned (mostly) how to switch inputs, continue to have a system whose behavior you can predict
- reasonable-quality sound in den and living room ...
- ... while playing CDs
- ... while playing radio
- ---- line of essential ^^^ vs. important vvv ---
- ... while playing songs from computer
From this Language Log post about the mysterious language on two postcards from the 1910's --
| [core] | |
| excludesfile = ~/.gitignore | |
| editor = nano | |
| askpass = /Users/flip/bin/git-password | |
| # pager = less -FRSX | |
| # whitespace = fix,-indent-with-non-tab,trailing-space,cr-at-eol | |
| [credential] | |
| helper = osxkeychain | |
| [color] | |
| diff = auto |
| require 'configliere' ; Settings.use :commandline | |
| require 'gorillib' | |
| require 'gorillib/data_munging' | |
| require 'pry' | |
| Settings.define :data_root, default: 's3n://bigdata.chimpy.us', description: "directory root for data to process" | |
| Settings.resolve! | |
| Pathname.register_paths( |
| #!/usr/bin/env ruby | |
| warn [ARGV.inspect, $0] | |
| WULIGN_VERSION = "1.0" | |
| USAGE= %Q{ | |
| # h1. wulign -- format a tab-separated file as aligned columns | |
| # | |
| # wulign will intelligently reformat a tab-separated file into a tab-separated, |
| { | |
| "fancy_chars.rb": {}, | |
| "hello.txt|jinja": { "allinputs": true } | |
| } |
== Overview of Datasets ==
The examples in this book use the "Chimpmark" datasets: a set of freely-redistributable datasets, converted to simple standard formats, with traceable provenance and documented schema. They are the same datasets as used in the upcoming Chimpmark Challenge big-data benchmark. The datasets are:
-
Wikipedia English-language Article Corpus (
wikipedia_corpus; 38 GB, 619 million records, 4 billion tokens): the full text of every English-language wikipedia article, in -
Wikipedia Pagelink Graph (
wikipedia_pagelinks; ) -- -
Wikipedia Pageview Stats (
wikipedia_pageviews; 2.3 TB, about 250 billion records (FIXME: verify num records)) -- hour-by-hour pageview
| require 'formula' | |
| class Kindlegen < Formula | |
| url 'http://s3.amazonaws.com/kindlegen/KindleGen_Mac_i386_v2_5.zip' | |
| homepage 'http://www.amazon.com/gp/feature.html?docId=1000234621' | |
| md5 '8daf6956d54df8030b12ec9116945482' | |
| version '2.5' | |
| skip_clean 'bin' |