Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn
, using NDV(mycolumn)
got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn)
took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash
function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select