If you've been following along at home the newer statistics developments,
you might have heard about this new
State of The Art Cardinality Estimation Algorithm called HyperLogLog. This
technique is now available for PostgreSQL in the extension postgresql-hll
available at https://github.com/aggregateknowledge/postgresql-hll and soon
to be in
How to Compute Cardinality?
It's as simple as
CREATE EXTENSION hll; really, even if to get there you
must have installed the package on your system. We did some packaging work
debian and the result should appear soon in a distro near you.
Then you also need to keep your data in some table, straight from the documentation we can use that schema:
-- Create the destination table CREATE TABLE daily_uniques ( DATE DATE UNIQUE, users hll );
Then to add some data for which you want to know the cardinality of, it's as
simple as in the following
UPDATE daily_uniques SET users = hll_add(users, hll_hash_text('220.127.116.11')) WHERE date = current_date;
So in our example what you see is that we want to decipher how many unique
IP addresses we saw, and we do that by first creating a hash of that source
data then calling
hll_add() with the current value and the hash result.
The current value must be initialized using
The most awake readers among you have already spotted that: using an
on the same row over and over again is a good recipe to kill any form of
concurrency, so you don't want to do that on your production setup unless
you don't care about those
UPDATE waiting piling up in your system.
The idea is then to fill-in a queue of updates and asynchronously update the
daily_uniques table from that queue, possibly using the
aggregate that the extension provides, so that you do only one
batch of values to process.
∅: Empty Set and NULL
Yes there's a unicode entry for that, ∅
Now, what happens when the batch of new unique values you want to update
from is itself empty? Well I would have expected
hll_add_agg over an empty
set to return an empty
hll value, the same as returned by
it turns out it's returning
hll_add(users, NULL) will happily return
NULL. So the next
is cancelling all the previous work, which is not nice. We had to cater for
that case explicitely in the
UPDATE query that's working from the batch of
new values to add to our current HyperLogLog hash entry, and I can't resist
to show off one of the most awesome PostgreSQL features here: writable CTE.
WITH hll(agg) AS ( SELECT hll_add_agg(hll_hash_text(value)) FROM new_batch ) UPDATE daily_uniques SET users = CASE WHEN hll.agg IS NULL THEN users ELSE hll_union(users, hll.agg) END FROM hll WHERE date = current_date;
That's how you protect against an empty set being turned into a
think the real fix would need to be included in
postgresql-hll itself, in
making it so that the
hll_add_agg aggregate returns
hll_empty() on an empty
set, and I will report that bug (with that very article as the detailed
explanation of it).
postgresql-hll on the production system, we were able to get some
good looking numbers from our
with stats as ( select date, #users as daily, #hll_union_agg(users) over() as total from daily_uniques ) select date, round(daily) as daily, round((daily/total*100)::numeric, 2) as percent from stats order by date; date | daily | percent ------------+--------+--------- 2013-02-22 | 401677 | 25.19 2013-02-23 | 660187 | 41.41 2013-02-24 | 869980 | 54.56 2013-02-25 | 154996 | 9.72 (4 rows)
I coulnd't resist to show off two of my favorite SQL constructs in that
example query here, which are the Common Table Expressions (or CTE) and
window functions. If that
over() clause reads strange to you, take a minute
now and go read about it. Yes, do that now, we're waiting.
The data here is showing that we did setup the facility in the middle of the first day, and that the morning's activity is quite low.
postgresql-hll you need to be careful not to kill your
application concurrency abilities, and you need to protect yourself against
the ∅ killer too. The other thing to keep in mind is that the numbers you
get out of the
hll technique are estimates within a given precision, and you
might want to read some more about what it means for your intended usage of