Skip to content

Instantly share code, notes, and snippets.

@ogirginc
Created May 11, 2023 15:38
Show Gist options
  • Save ogirginc/1cdfc7bc6e6fb069b371aa762bccab51 to your computer and use it in GitHub Desktop.
Save ogirginc/1cdfc7bc6e6fb069b371aa762bccab51 to your computer and use it in GitHub Desktop.
Can you explain how HyperLogLog algorithm works with examples in Ruby? (Answered by GTP on phind.com )

The HyperLogLog (HLL) algorithm is a probabilistic data structure used for approximating the number of distinct elements in a multiset while using significantly less memory compared to exact methods Source 3. It is an extension of the earlier LogLog algorithm and derives from the 1984 Flajolet–Martin algorithm Source 3.

The main idea behind the HLL algorithm is to estimate the cardinality of a set by observing the maximum number of leading zeros in the binary representation of each number in the set Source 0. To achieve this, the following steps are taken:

  1. Apply a hash function to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset Source 3.

  2. Split the multiset into multiple subsets (buckets) Source 0.

  3. Calculate the maximum number of leading zeros in the numbers of each subset Source 3.

  4. Use a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set Source 3.

Here's an example in Ruby:

require 'digest'

# Hash function
def hash_function(element)
  Digest::SHA1.hexdigest(element.to_s).to_i(16)
end

# Extract bucket index and leading zeros count from hashed value
def extract_bucket_and_zeros(hashed_value, bucket_bits)
  bucket_mask = (1 << bucket_bits) - 1
  bucket_index = hashed_value & bucket_mask
  leading_zeros = hashed_value.bit_length - bucket_bits
  [bucket_index, leading_zeros]
end

# HyperLogLog algorithm
def hyperloglog(elements, bucket_bits)
  m = 1 << bucket_bits
  buckets = Array.new(m, 0)

  elements.each do |element|
    hashed_value = hash_function(element)
    bucket_index, leading_zeros = extract_bucket_and_zeros(hashed_value, bucket_bits)
    buckets[bucket_index] = [buckets[bucket_index], leading_zeros].max
  end

  harmonic_mean = m * (1.0 / buckets.map { |count| 1 << count }.reduce(0, :+))
  estimated_cardinality = harmonic_mean.round
  estimated_cardinality
end

# Example usage
elements = (1..10000).to_a.shuffle
bucket_bits = 10
estimated_cardinality = hyperloglog(elements, bucket_bits)
puts "Estimated cardinality: #{estimated_cardinality}"

This Ruby example demonstrates the HLL algorithm's steps: applying a hash function, splitting the multiset into buckets, calculating the maximum number of leading zeros in the numbers of each bucket, and using a harmonic mean to estimate the cardinality of the whole set. In this case, the hash function used is SHA1, but you can choose any other suitable hash function.

Keep in mind that HLL is an approximation algorithm, and its accuracy depends on the number of buckets used. The error rate is roughly 1.3/sqrt(number of buckets) Source 0. HLL can estimate cardinalities of > 10^9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory Source 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment