Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn
, using NDV(mycolumn)
got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn)
took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash
function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select
(fnv_hash(mycolumn) & 1023) bucket
from
mytable
Here's some sample output:
+--------+
| bucket |
+--------+
| 837 |
| 566 |
| 642 |
| 674 |
| 486 |
+--------+
HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by 2^32-1
. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value z
:
select
(fnv_hash(mycolumn) & 1023) bucket,
(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
+--------+---+
| bucket | z |
+--------+---+
| 599 | 1 |
| 574 | 2 |
| 43 | 1 |
| 360 | 3 |
| 644 | 3 |
+--------+---+
Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:
select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
group by
bucket
+--------+----+
| bucket | z |
+--------+----+
| 283 | 22 |
| 977 | 17 |
| 630 | 16 |
| 208 | 15 |
| 315 | 20 |
+--------+----+
The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).
select
floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
from
(select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from mytable
group by bucket) bucket_values
And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.
+----------+
| estimate |
+----------+
| 94485825 |
+----------+
I've just come across this posting recently. It would be interesting to have the information about the table structure and format of the table used in the initial test.
I've run a few tests myself with a table in Parquet format and even if performance was in some cases better with NDV, I could not get the huge difference in run times that avibryant has reported.