Daniel Yokomizo dyokomizo

Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

select

Below is some cute code that I wrote for the OLPC Openfirmware HD Audio driver.

The firmware is allowed to hard-code knowledge of the physical design of the laptop and motherboard. This firmware code tells the audio chip explicit details such as the size and colour of each physical audio jack in the laptop. The operating system can later query the chip for this information and present it to a user, for example in an audio mixer application.

porta  config(  1/8" green left hp-out jack     )config
porta  config(  1/8" green left hp-out jack     )config
portb  config(  1/8" pink left mic-in jack      )config
portc  config(  builtin internal top mic-in no-detect other-analog )config
portd  config(  unused line-out no-detect       )config

	-- in reply to http://www.reddit.com/r/haskell/comments/21mja6/make_lllegal_state_transitions_unrepresentable/
	--
	-- We implement a tiny language with three commands: Open, Close, and Get.
	-- The first Get after an Open returns 1, the second Get returns 2, and so on.
	--
	-- Get is only valid while the state is open, and
	-- Open must always be matched by a Close.
	-- We enforce both restrictions via the type system.
	--
	-- There are two valid states: Opened and Closed.

	(* This is a demonstration of the use of the SML module system to
	encode (Generalized Algebraic Datatypes) GADTs via Church
	encodings. The basic idea is to use the Church encoding of GADTs in
	System Fomega and translate the System Fomega type into the module
	system. As I demonstrate below, this allows things like the
	singleton type of booleans, and the equality type, to be
	represented.

	This was inspired by Jon Sterling's blog post about encoding proofs
	in the SML module system:

	(* It is not possible in Standard ML to construct an identity type (or any other
	* indexed type), but that does not stop us from speculating. We can specify with
	* a signature equality should mean, and then use a functor to say, "If there
	* were a such thing as equality, then we could prove these things with it."
	* Likewise, we can use the same trick to define the booleans and their
	* induction principle at the type-level, and speculate what proofs we could
	* make if we indeed had the booleans and their induction principle.
	*
	* Functions may be defined by asserting their inputs and outputs as
	* propositional equalities in a signature; these "functions" do not compute,

	open import Data.Nat
	open import Data.String
	open import Function
	open import LabelledTypes
	module LabelledAdd1 where

	----------------------------------------------------------------------

	add : (m n : ℕ) → ⟨ "add" ∙ m ∙ n ∙ ε ∶ ℕ ⟩
	add m n = elimℕ (λ m' → ⟨ "add" ∙ m' ∙ n ∙ ε ∶ ℕ ⟩)

	> {-# LANGUAGE Rank2Types #-}
	>
	> import Prelude hiding (id)
	> import Data.Functor.Constant (Constant(Constant), getConstant)
	> import Control.Arrow (Arrow, arr, first, Kleisli(Kleisli), runKleisli, (&&&))
	> import Control.Category ((<<<))
	> import Control.Lens (sequenceAOf)
	> import qualified Control.Lens as L
	> import qualified Control.Lens.Internal.Setter as LS
	> import Data.Profunctor (Profunctor, rmap)

	{-# LANGUAGE QuantifiedConstraints, ScopedTypeVariables #-}

	import Data.Monoid

	-- Note: fails for instances that don't need the Monoid a or Semigroup a

	pureDefault :: forall f a. (Functor f, forall a. Monoid a => Monoid (f a)) => a -> f a
	pureDefault a = a <$ (mempty :: f ())

	apDefault :: (Functor f, forall a. Semigroup a => Semigroup (f a)) => f (a -> b) -> f a -> f b

	After spending the better part of the month implementing date support
	in RethinkDB, Mike Lucy sent the team the following e-mail. It would
	have been funny, if it didn't cause thousands of programmers so much
	pain. Read it, laugh, and weep!

	-----

	So, it turns out that we're only going to support dates between the
	year 1400 and the year 10000 (inclusive), because that's what boost
	supports.

	How do we add extra information to a tree? This has been called [The
	AST Typing
	Problem](http://blog.ezyang.com/2013/05/the-ast-typing-problem/).

	After being hit with this problem in Roy's new type-inference engine,
	I tried figuring out how to represent the algorithm. I eventually
	realised that it looked like a comonadic operation. Turns out it's
	been done before but I couldn't find any complete example.

	Below is some literate Haskell to show how to use the Cofree Comonad