If you've been following along at home the newer statistics developments,
you might have heard about this new
State of The Art Cardinality Estimation Algorithm called HyperLogLog. This
technique is now available for PostgreSQL in the extension postgresql-hll
available at https://github.com/aggregateknowledge/postgresql-hll and soon
to be in debian.

How to Compute Cardinality?
Installing postgresql-hll
It's as simple as CREATE EXTENSION hll; really, even if to get there you
must have installed the package on your system. We did some packaging work
for debian and the result should appear soon in a distro near you.
Then you also need to keep your data in some table, straight from the documentation we can use that schema:
-- Create the destination table CREATE TABLE daily_uniques ( DATE DATE UNIQUE, users hll );
Then to add some data for which you want to know the cardinality of, it's as
simple as in the following UPDATE statement:
UPDATE daily_uniques SET users = hll_add(users, hll_hash_text('123.123.123.123')) WHERE date = current_date;
So in our example what you see is that we want to decipher how many unique
IP addresses we saw, and we do that by first creating a hash of that source
data then calling hll_add() with the current value and the hash result.
The current value must be initialized using hll_empty().
Concurrency
The most awake readers among you have already spotted that: using an UPDATE
on the same row over and over again is a good recipe to kill any form of
concurrency, so you don't want to do that on your production setup unless
you don't care about those UPDATE waiting piling up in your system.
The idea is then to fill-in a queue of updates and asynchronously update the
daily_uniques table from that queue, possibly using the hll_add_agg
aggregate that the extension provides, so that you do only one update per
batch of values to process.
∅: Empty Set and NULL

Yes there's a unicode entry for that, ∅
Now, what happens when the batch of new unique values you want to update
from is itself empty? Well I would have expected hll_add_agg over an empty
set to return an empty hll value, the same as returned by hll_empty(), but
it turns out it's returning NULL instead.
And then hll_add(users, NULL) will happily return NULL. So the next UPDATE
is cancelling all the previous work, which is not nice. We had to cater for
that case explicitely in the UPDATE query that's working from the batch of
new values to add to our current HyperLogLog hash entry, and I can't resist
to show off one of the most awesome PostgreSQL features here: writable CTE.
WITH hll(agg) AS ( SELECT hll_add_agg(hll_hash_text(value)) FROM new_batch ) UPDATE daily_uniques SET users = CASE WHEN hll.agg IS NULL THEN users ELSE hll_union(users, hll.agg) END FROM hll WHERE date = current_date;
That's how you protect against an empty set being turned into a NULL. I
think the real fix would need to be included in postgresql-hll itself, in
making it so that the hll_add_agg aggregate returns hll_empty() on an empty
set, and I will report that bug (with that very article as the detailed
explanation of it).
Using postgresql-hll
When using postgresql-hll on the production system, we were able to get some
good looking numbers from our daily_uniques table:
with stats as ( select date, #users as daily, #hll_union_agg(users) over() as total from daily_uniques ) select date, round(daily) as daily, round((daily/total*100)::numeric, 2) as percent from stats order by date; date | daily | percent ------------+--------+--------- 2013-02-22 | 401677 | 25.19 2013-02-23 | 660187 | 41.41 2013-02-24 | 869980 | 54.56 2013-02-25 | 154996 | 9.72 (4 rows)
I coulnd't resist to show off two of my favorite SQL constructs in that
example query here, which are the Common Table Expressions (or CTE) and
window functions. If that over() clause reads strange to you, take a minute
now and go read about it. Yes, do that now, we're waiting.
The data here is showing that we did setup the facility in the middle of the first day, and that the morning's activity is quite low.
Conclusion

When using postgresql-hll you need to be careful not to kill your
application concurrency abilities, and you need to protect yourself against
the ∅ killer too. The other thing to keep in mind is that the numbers you
get out of the hll technique are estimates within a given precision, and you
might want to read some more about what it means for your intended usage of
the feature.
Tags
Previous Articles
- Another Great FOSDEM Monday, February 04 2013, 09:55
- Live Upgrading PGQ Friday, February 08 2013, 15:52
- Marking whole word Friday, February 08 2013, 17:15
- Playing with pgloader Tuesday, February 12 2013, 11:17
Next Articles
- HyperLogLog Unions Tuesday, February 26 2013, 12:44
- Emacs Conference Monday, March 04 2013, 13:58
- Batch Update Friday, March 15 2013, 10:47
- Bulk Replication Monday, March 18 2013, 14:54

