HyperLogLog is an amazing data structure for estimating the cardinality (with very high accuracy) of large data sets that uses very little memory. This implementation of HyperLogLog contains the original algorithm by Flajolet et. al as well hyperloglog++ algorithm by Heule et. al. Refer 'References' section for blog posts/paper to find out the inner workings of hyperloglog.
git clone https://github.com/prasanthj/hyperloglog.git hyperloglog
cd hyperloglog
mvn package -DskipTests
After running mvn package -DskipTests
, run hll
to display the usage options
Example usage: hll -n 1000 <OR> hll -f /tmp/input.txt <OR> hll -d -i /tmp/out.hll
usage: HyperLogLog
-b,--enable-bitpacking <arg> enable bit-packing of registers. default =
true
-c,--no-bias <arg> use bias correction table (no-bias
algorithm). default = true
-d,--deserialize deserialize hyperloglog from file. specify
-i for input file
-e,--encoding <arg> specify encoding to use (SPARSE or DENSE).
default = SPARSE
-f,--file <arg> specify file to read input data
-i,--input-file <arg> specify input file for deserialization
-n,--num-random-values <arg> number of random values to generate
-o,--output-file <arg> specify output file for serialization
-p,--num-register-bits <arg> number of bits from hashcode used as
register index between 4 and 16 (both
inclusive). default = 14
-r,--relative-error print relative error calculation
-s,--serialize serialize hyperloglog to file. specify -o
for output file
-t,--standard-in read data from standard in
Test with 'n' random numbers
#./hll -r -n 20000
Actual count: 20000
Encoding: DENSE, p: 14, estimatedCardinality: 19993
Relative error: 0.034999847%
Test with input file
#./hll -r -f /etc/passwd
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%
Test serialization
#./hll -r -n 100000000 -s -o /tmp/out.hll
Actual count: 100000000
Encoding: DENSE, p: 14, estimatedCardinality: 100069607
Relative error: -0.069606304%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 10248 bytes
Serialization time: 20 ms
./hll -r -f /etc/passwd -s -o /tmp/out.hll
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 337 bytes
Serialization time: 5 ms
Test deserialization
#./hll -d -i /tmp/passwd.hll
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Count after deserialization: 84
Deserialization time: 42 ms
Test disabling bit-packing of registers
#./hll -r -n 10000000 -b false -s -o /tmp/out.hll
Actual count: 10000000
Encoding: DENSE, p: 14, estimatedCardinality: 10052011
Relative error: -0.52011013%
Serialized hyperloglog to /tmp/out.hll
Serialized size: 16392 bytes
Serialization time: 27 ms
Test reading from standard in
#cat /etc/passwd | ./hll -r -t
Actual count: 84
Encoding: SPARSE, p: 14, estimatedCardinality: 84
Relative error: 0.0%
Bug fixes or improvements are welcome! Please fork the project and send pull request on github. Or report issues here https://github.com/prasanthj/hyperloglog/issues
Apache licensed.
[2] http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
[3] http://research.neustar.biz/tag/flajolet-martin-sketch/
[4] http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/