So, if you followed the previous blog entry, now you have a new database
containing all the
static tables encoded in
UTF-8 rather than
SQL_ASCII. Because if it was not yet the case, you now severely distrust
Now is the time to have a look at properly encoding the
live data, those
stored in tables that continue to receive write traffic. The idea is to use
UPDATE facilities of PostgreSQL to tweak the data, and too fix the
applications so as not to continue inserting badly encoded strings in there.
Finding non UTF-8 data
First you want to find out the badly encoded data. You can do that with this
helper function that
RhodiumToad gave me on IRC. I had a version from the
archives before that, but the
regexp was hard to maintain and quote into a
PL function. This is avoided by two means, first one is to have a separate
SQL function for the
regexp checking (so that you can index it should
you need to) and the other one is to apply the regexp to
data. Here we go:
create or replace function public.utf8hex_valid(str text) returns boolean language sql immutable as $f$ select $1 ~ $r$(?x) ^(?:(?:[0-7][0-9a-f]) |(?:(?:c[2-9a-f]|d[0-9a-f]) |e0[ab][0-9a-f] |ed[0-9a-f] |(?:(?:e[1-9abcef]) |f0[9ab][0-9a-f] |f[1-3][89ab][0-9a-f] |f48[0-9a-f] )[89ab][0-9a-f] )[89ab][0-9a-f] )*$ $r$; $f$;
Now some little scripting around it in order to skip intense manual and boring work (and see, some more catalog queries). Don't forget we will have to work on a per-column basis here...
create or replace function public.check_encoding_utf8 ( IN schemaname text, IN tablename text, OUT relname text, OUT attname text, OUT count bigint ) returns setof record language plpgsql as $f$ DECLARE v_sql text; BEGIN FOR relname, attname IN SELECT c.relname, a.attname FROM pg_attribute a JOIN pg_class c on a.attrelid = c.oid JOIN pg_namespace s on s.oid = c.relnamespace JOIN pg_roles r on r.oid = c.relowner WHERE s.nspname = schemaname AND atttypid IN (25, 1043) -- text, varchar AND relkind = 'r' -- ordinary table AND r.rolname = 'some_specific_role' AND CASE WHEN tablename IS NOT NULL THEN c.relname ~ tablename ELSE true END LOOP v_sql := 'SELECT count(*) ' || ' FROM ONLY '|| schemaname || '.' || relname || ' WHERE NOT public.utf8hex_valid(encode(textsend(' || attname || '), ''hex''))'; -- RAISE NOTICE 'Checking: %.%', relname, attname; -- RAISE NOTICE 'SQL: %', v_sql; EXECUTE v_sql INTO count; RETURN NEXT; END LOOP; END; $f$;
Note that the
tablename is compared using the
~ operator, so that's
matching there too. Also note that I wanted only to check those tables that
are owned by a specific role, your case may vary.
The way I used this function was like this:
create table leon.check_utf8 as select * from public.check_encoding_utf8();
Then you need to take action on those lines in
leon.check_utf8 table which
count > 0. Rince and repeat, but you may soon realise building the
table over and over again is costly.
Cleaning up the data
Up for some more helper tools? Unless you really want to manually fix this
huge amount of columns where some data ain't
UTF-8 compatible... here's some
create or replace function leon.nettoyeur ( IN action text, IN encoding text, IN tablename text, IN columname text, OUT orig text, OUT utf8 text ) returns setof record language plpgsql as $f$ DECLARE p_convert text; BEGIN IF encoding IS NULL THEN p_convert := 'translate(' || columname || ', ' || $$'\211\203\202'$$ || ', ' || $$' '$$ || ') '; ELSE -- in 8.2, write convert using, in 8.3, the other expression -- p_convert := 'convert(' || columname || ' using ' || conversion || ') '; p_convert := 'convert(textsend(' || columname || '), '''|| encoding ||''', ''utf-8'' ) '; END IF; IF action = 'select' THEN FOR orig, utf8 IN EXECUTE 'SELECT ' || columname || ', ' || p_convert || ' FROM ONLY ' || tablename || ' WHERE not public.utf8hex_valid(' || 'encode(textsend('|| columname ||'), ''hex''))' LOOP RETURN NEXT; END LOOP; ELSIF action = 'update' THEN EXECUTE 'UPDATE ONLY ' || tablename || ' SET ' || columname || ' = ' || p_convert || ' WHERE not public.utf8hex_valid(' || 'encode(textsend('|| columname ||'), ''hex''))'; FOR orig, utf8 IN SELECT * FROM leon.nettoyeur('select', encoding, tablename, columname) LOOP RETURN NEXT; END LOOP; ELSE RAISE EXCEPTION 'Léon, Nettoyeur, veut de l''action.'; END IF; END; $f$;
As you can see, this function allows to check the conversion process from a
given supposed encoding before to actually convert the data in place. This
is very useful as even when you're pretty sure the non-utf8 data is
sometime you find it's
windows-1252 or such. So double check before telling
leon.nettoyeur() to update your precious data!
Also, there's a facility to use
translate() when none of the encoding match
your expectations. This is a skeleton just replacing invalid characters with
space, tweak it at will!
Enjoy your clean database now, even if it still accepts new data that will
probably not pass the checks, so we still have to be careful about that and
re-clean every day until the migration is effective. Or maybe add a
clause that will reject badly encoded data...
In fact here we're using
Londiste to replicate the
live data from the old to
the new server, and that means the replication will break each time there's
new data written in non-utf8, as the new server is running
8.4, which by
design ain't very forgiving. Our plan is to clean-up as we go (remove table
subscriber, fix it, add it again) and migrate as soon as possible!
Bonus points to those of you getting the convoluted reference :)