The idea of the day ain't directly from me, I'm just helping with a very
thin subpart of the problem. The problem, I can't say much about, let's just
assume you want to reduce the storage of MD5 in your database, so you want
to abuse bit strings. A solution to use them works fine, but the datatype is
still missing some facilities, for example going from and to hexadecimal
representation in text.
create or replace function hex_to_varbit(h text) returns varbit language sql as $$ select ('X' || $1)::varbit; $$; create or replace function varbit_to_hex(b varbit) returns text language sql as $$ select array_to_string(array_agg(to_hex((b << (32*o))::bit(32)::bigint)), '') from (select b, generate_series(0, n-1) as o from (select $1, octet_length($1)/4) as t(b, n)) as x $$;
To understand the magic in the second function, let's walk through the tests
one could do when wanting to grasp how things work in the bitstring world
(using also some reading of the fine documentation, too).
=# select ('101011001011100110010110'::varbit << 0)::bit(8);
bit
----------
10101100
(1 row)
=# select ('101011001011100110010110'::varbit << 8)::bit(8);
bit
----------
10111001
(1 row)
=# select ('101011001011100110010110'::varbit << 16)::bit(8);
bit
----------
10010110
(1 row)
=# select * from *TEMP VERSION OF THE FUNCTION FOR TESTING*
o | b | x
---+----------------------------------+----------
0 | 10101100101111010001100011011011 | acbd18db
1 | 01001100110000101111100001011100 | 4cc2f85c
2 | 11101101111011110110010101001111 | edef654f
3 | 11001100110001001010010011011000 | ccc4a4d8
(4 rows)
What do we get from that, will you ask? Let's see a little example:
=# select hex_to_varbit(md5('foo'));
hex_to_varbit
----------------------------------------------------------------------------------------------------------------------------------
10101100101111010001100011011011010011001100001011111000010111001110110111101111011001010100111111001100110001001010010011011000
(1 row)
=# select md5('foo'), varbit_to_hex(hex_to_varbit(md5('foo')));
md5 | varbit_to_hex
----------------------------------+----------------------------------
acbd18db4cc2f85cedef654fccc4a4d8 | acbd18db4cc2f85cedef654fccc4a4d8
(1 row)
Storing varbits rather than the text form of the MD5 allows us to go from
6510 MB down to 4976 MB on a sample table containing 100 millions
rows. We're targeting more that that, so that's a great win down here!
In case you wonder, querying the main index on varbit rather than the one on
text for a single result row, the cost of doing the conversion with
varbit_to_hex seems to be around 28 µs. We can afford it.
Hope this helps!
Tags
Previous Articles
- debian packaging PostgreSQL extensions Friday, August 06 2010, 13:00
- Editing constants in constraints Monday, August 09 2010, 14:45
- el-get and dim-switch-window status update Monday, August 09 2010, 15:35
- el-get news Thursday, August 26 2010, 16:30
Next Articles
- welcome el-get scratch installer Friday, August 27 2010, 14:15
- Happy Numbers Monday, August 30 2010, 11:00
- Want to share your recipes? Tuesday, August 31 2010, 14:15
- Synchronous Replication Monday, September 06 2010, 18:05

