Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn
, using NDV(mycolumn)
got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn)
took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash
function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select
(fnv_hash(mycolumn) & 1023) bucket
from
mytable
Here's some sample output:
+--------+
| bucket |
+--------+
| 837 |
| 566 |
| 642 |
| 674 |
| 486 |
+--------+
HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by 2^32-1
. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value z
:
select
(fnv_hash(mycolumn) & 1023) bucket,
(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
+--------+---+
| bucket | z |
+--------+---+
| 599 | 1 |
| 574 | 2 |
| 43 | 1 |
| 360 | 3 |
| 644 | 3 |
+--------+---+
Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:
select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
group by
bucket
+--------+----+
| bucket | z |
+--------+----+
| 283 | 22 |
| 977 | 17 |
| 630 | 16 |
| 208 | 15 |
| 315 | 20 |
+--------+----+
The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).
select
floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
from
(select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from mytable
group by bucket) bucket_values
And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.
+----------+
| estimate |
+----------+
| 94485825 |
+----------+
~11.5 billion records in Impala nested array (.amft), parquet table:
select ndv(concat_ws('|', part_code,cast(activity_dollars as string)))
from fact.amft
got: 9,460,217
in: 5.4 minutes.
select COUNT(DISTINCT part_code,activity_dollars)
from fact.amft
got: 9,298,947
in: 5.1 minutes.
So for Impala above case ran faster with exact count distinct.
ps. Reran Test 1 again (after Test 2 was done), to eliminate disk caching effects etc; got 5.6 minutes - still slower.
I think it might have something with the fact that it is a parquet table, so it doesn't have to read whole data set.
It just needs to read metadata from rowchunks / column groups.
To use ndv() I had to pass it one argument, convert to string, and then Impala actually had to read whole dataset.