In the article from yesterday we talked about PostgreSQL HyperLogLog with some details. The real magic of that extension has been skimmed over though, and needs another very small article all by itself, in case you missed it.
Which Set Operation do you want for counting unique values?
The first query here has the default level of magic in it, really. What happens is that each time we do an update of the HyperLogLog hash value, we update some data which are allowing us to compute its cardinality.
=> select date, #users as daily, pg_column_size(users) as bytes from daily_uniques order by date; date | daily | bytes ------------+------------------+------- 2013-02-22 | 401676.779509985 | 1287 2013-02-23 | 660187.271908359 | 1287 2013-02-24 | 869980.029947449 | 1287 2013-02-25 | 580865.296677817 | 1287 2013-02-26 | 240569.492722719 | 1287 (5 rows)
And has advertized the data is kept in a static sized data structure. The
magic here all happens at
hll_add() time, the function you have to call to
update the data.
Now on to something way more magic!
Are those the aggregates you're looking for?
=> select to_char(date, 'YYYY/MM') as month, round(#hll_union_agg(users)) as monthly from daily_uniques group by 1; month | monthly ---------+--------- 2013/02 | 1960380 (1 row)
The HyperLogLog data structure is allowing the implementation of an union
algorithm that will be able to compute how many unique values you happen to
have registered in both one day and the next. Extended in its general form,
and doing SQL, what you get is an aggregate that you can use in
constructs and window functions. Did you read about them yet?