Thursday, August 26 2010
PostgreSQL, Tricks, YeSQL

Playing with bit strings

The idea of the day ain't directly from me, I'm just helping with a very thin subpart of the problem. The problem, I can't say much about, let's just assume you want to reduce the storage of MD5 in your database, so you want to abuse bit strings. A solution to use them works fine, but the datatype is still missing some facilities, for example going from and to hexadecimal representation in text.

create or replace function hex_to_varbit(h text)
 returns varbit
 language sql
as $$
  select ('X' || $1)::varbit;
$$;

create or replace function varbit_to_hex(b varbit)
 returns text
 language sql
as $$
  select array_to_string(array_agg(to_hex((b << (32*o))::bit(32)::bigint)), '')
    from (select b, generate_series(0, n-1) as o
            from (select $1, octet_length($1)/4) as t(b, n)) as x
$$;

To understand the magic in the second function, let's walk through the tests one could do when wanting to grasp how things work in the bitstring world (using also some reading of the fine documentation, too).

=# select ('101011001011100110010110'::varbit << 0)::bit(8);
   bit    
----------
 10101100
(1 row)

=# select ('101011001011100110010110'::varbit << 8)::bit(8);
   bit    
----------
 10111001
(1 row)

=# select ('101011001011100110010110'::varbit << 16)::bit(8);
   bit    
----------
 10010110
(1 row)

=# select * from *TEMP VERSION OF THE FUNCTION FOR TESTING*
 o |                b                 |    x     
---+----------------------------------+----------
 0 | 10101100101111010001100011011011 | acbd18db
 1 | 01001100110000101111100001011100 | 4cc2f85c
 2 | 11101101111011110110010101001111 | edef654f
 3 | 11001100110001001010010011011000 | ccc4a4d8
(4 rows)

What do we get from that, will you ask? Let's see a little example:

=# select hex_to_varbit(md5('foo'));
                                                          hex_to_varbit                                                           
----------------------------------------------------------------------------------------------------------------------------------
 10101100101111010001100011011011010011001100001011111000010111001110110111101111011001010100111111001100110001001010010011011000
(1 row)

=# select md5('foo'), varbit_to_hex(hex_to_varbit(md5('foo')));
               md5                |          varbit_to_hex           
----------------------------------+----------------------------------
 acbd18db4cc2f85cedef654fccc4a4d8 | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)

Storing varbits rather than the text form of the MD5 allows us to go from 6510 MB down to 4976 MB on a sample table containing 100 millions rows. We're targeting more that that, so that's a great win down here!

In case you wonder, querying the main index on varbit rather than the one on text for a single result row, the cost of doing the conversion with varbit_to_hex seems to be around 28 µs. We can afford it.

Hope this helps!