Playing with bit strings
The idea of the day ain’t directly from me, I’m just helping with a very
thin subpart of the problem. The problem, I can’t say much about, let’s just
assume you want to reduce the storage of
MD5 in your database, so you want
to abuse
bit strings. A solution to use them works fine, but the datatype is
still missing some facilities, for example going from and to hexadecimal
representation in text.
create or replace function hex_to_varbit(h text)
returns varbit
language sql
as $$
select ('X' || $1)::varbit;
$$;
create or replace function varbit_to_hex(b varbit)
returns text
language sql
as $$
select array_to_string(array_agg(to_hex((b << (32*o))::bit(32)::bigint)), '')
from (select b, generate_series(0, n-1) as o
from (select $1, octet_length($1)/4) as t(b, n)) as x
$$;
To understand the magic in the second function, let’s walk through the tests
one could do when wanting to grasp how things work in the
bitstring world
(using also some reading of the fine documentation, too).
=# select ('101011001011100110010110'::varbit << 0)::bit(8);
bit
----------
10101100
(1 row)
=# select ('101011001011100110010110'::varbit << 8)::bit(8);
bit
----------
10111001
(1 row)
=# select ('101011001011100110010110'::varbit << 16)::bit(8);
bit
----------
10010110
(1 row)
=# select * from *TEMP VERSION OF THE FUNCTION FOR TESTING*
o | b | x
---+----------------------------------+----------
0 | 10101100101111010001100011011011 | acbd18db
1 | 01001100110000101111100001011100 | 4cc2f85c
2 | 11101101111011110110010101001111 | edef654f
3 | 11001100110001001010010011011000 | ccc4a4d8
(4 rows)
What do we get from that, will you ask? Let’s see a little example:
=# select hex_to_varbit(md5('foo'));
hex_to_varbit
----------------------------------------------------------------------------------------------------------------------------------
10101100101111010001100011011011010011001100001011111000010111001110110111101111011001010100111111001100110001001010010011011000
(1 row)
=# select md5('foo'), varbit_to_hex(hex_to_varbit(md5('foo')));
md5 | varbit_to_hex
----------------------------------+----------------------------------
acbd18db4cc2f85cedef654fccc4a4d8 | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)
Storing
varbits rather than the
text form of the
MD5 allows us to go from
6510 MB down to
4976 MB on a sample table containing 100 millions
rows. We’re targeting more that that, so that’s a great win down here!
In case you wonder, querying the main index on
varbit rather than the one on
text for a single result row, the cost of doing the conversion with
varbit_to_hex seems to be around
28 µs. We can afford it.
Hope this helps!