Tuesday, February 23 2010
PostgreSQL, plpgsql, Encoding

Getting out of SQL_ASCII, part 2

So, if you followed the previous blog entry, now you have a new database containing all the static tables encoded in UTF-8 rather than SQL_ASCII. Because if it was not yet the case, you now severely distrust this non-encoding.

Now is the time to have a look at properly encoding the live data, those stored in tables that continue to receive write traffic. The idea is to use the UPDATE facilities of PostgreSQL to tweak the data, and too fix the applications so as not to continue inserting badly encoded strings in there.

Finding non UTF-8 data

First you want to find out the badly encoded data. You can do that with this helper function that RhodiumToad gave me on IRC. I had a version from the archives before that, but the regexp was hard to maintain and quote into a PL function. This is avoided by two means, first one is to have a separate pure SQL function for the regexp checking (so that you can index it should you need to) and the other one is to apply the regexp to hex encoded data. Here we go:

create or replace function public.utf8hex_valid(str text) 
 returns boolean
 language sql immutable
as $f$
   select $1 ~ $r$(?x)
                  ^(?:(?:[0-7][0-9a-f])
                     |(?:(?:c[2-9a-f]|d[0-9a-f])
                        |e0[ab][0-9a-f]
                        |ed[89][0-9a-f]
                        |(?:(?:e[1-9abcef])
                           |f0[9ab][0-9a-f]
                           |f[1-3][89ab][0-9a-f]
                           |f48[0-9a-f]
                          )[89ab][0-9a-f]
                       )[89ab][0-9a-f]
                    )*$
                $r$;
$f$;

Now some little scripting around it in order to skip intense manual and boring work (and see, some more catalog queries). Don't forget we will have to work on a per-column basis here...

create or replace function public.check_encoding_utf8
 (
   IN schemaname text,
   IN tablename  text,
  OUT relname    text,
  OUT attname    text,
  OUT count      bigint
 )
 returns setof record
 language plpgsql
as $f$
DECLARE
  v_sql text;
BEGIN
  FOR relname, attname
   IN SELECT c.relname, a.attname 
        FROM pg_attribute a 
             JOIN pg_class c on a.attrelid = c.oid
             JOIN pg_namespace s on s.oid = c.relnamespace 
	     JOIN pg_roles r on r.oid = c.relowner
       WHERE s.nspname = schemaname
         AND atttypid IN (25, 1043) -- text, varchar
         AND relkind = 'r'          -- ordinary table
         AND r.rolname = 'some_specific_role'
	 AND CASE WHEN tablename IS NOT NULL
	     	  THEN c.relname ~ tablename
		  ELSE true
	      END
  LOOP
    v_sql := 'SELECT count(*) '
          || '  FROM ONLY '|| schemaname || '.' || relname 
          || ' WHERE NOT public.utf8hex_valid(encode(textsend(' 
          || attname
          || '), ''hex''))';

    -- RAISE NOTICE 'Checking: %.%', relname, attname;
    -- RAISE NOTICE 'SQL: %', v_sql;
    EXECUTE v_sql INTO count;
    RETURN NEXT;
  END LOOP;
END;
$f$; 

Note that the tablename is compared using the ~ operator, so that's regexp matching there too. Also note that I wanted only to check those tables that are owned by a specific role, your case may vary.

The way I used this function was like this:

create table leon.check_utf8 as
 select * 
   from public.check_encoding_utf8();

Then you need to take action on those lines in leon.check_utf8 table which have a count > 0. Rince and repeat, but you may soon realise building the table over and over again is costly.

Cleaning up the data

Up for some more helper tools? Unless you really want to manually fix this huge amount of columns where some data ain't UTF-8 compatible... here's some more:

create or replace function leon.nettoyeur
 (
  IN  action      text,
  IN  encoding    text,
  IN  tablename   text,
  IN  columname   text,

  OUT orig        text,
  OUT utf8        text
 )
 returns setof record
 language plpgsql
as $f$
DECLARE
  p_convert text;
BEGIN
  IF encoding IS NULL
  THEN
    p_convert := 'translate(' 
              || columname || ', ' 
              || $$'\211\203\202'$$ 
              || ', '
              || $$'   '$$
	      || ') ';
  ELSE
    -- in 8.2, write convert using, in 8.3, the other expression
    -- p_convert := 'convert(' || columname || ' using ' || conversion || ') ';
    p_convert := 'convert(textsend(' || columname || '), '''|| encoding ||''', ''utf-8'' ) ';
  END IF;

  IF action = 'select'
  THEN
    FOR orig, utf8
     IN EXECUTE 'SELECT ' || columname || ', '
         || p_convert
         || '  FROM ONLY ' || tablename
         || ' WHERE not public.utf8hex_valid('
         || 'encode(textsend('|| columname ||'), ''hex''))'
    LOOP
      RETURN NEXT;
    END LOOP;

  ELSIF action = 'update'
  THEN
    EXECUTE 'UPDATE ONLY ' || tablename 
         || ' SET ' || columname || ' = ' || p_convert
         || ' WHERE not public.utf8hex_valid('
         || 'encode(textsend('|| columname ||'), ''hex''))';

    FOR orig, utf8 
     IN SELECT * 
          FROM leon.nettoyeur('select', encoding, tablename, columname)
    LOOP
      RETURN NEXT;
    END LOOP;

  ELSE
    RAISE EXCEPTION 'Léon, Nettoyeur, veut de l''action.';

  END IF;
END;
$f$;

As you can see, this function allows to check the conversion process from a given supposed encoding before to actually convert the data in place. This is very useful as even when you're pretty sure the non-utf8 data is latin1, sometime you find it's windows-1252 or such. So double check before telling leon.nettoyeur() to update your precious data!

Also, there's a facility to use translate() when none of the encoding match your expectations. This is a skeleton just replacing invalid characters with a space, tweak it at will!

Conclusion

Enjoy your clean database now, even if it still accepts new data that will probably not pass the checks, so we still have to be careful about that and re-clean every day until the migration is effective. Or maybe add a CHECK clause that will reject badly encoded data...

In fact here we're using Londiste to replicate the live data from the old to the new server, and that means the replication will break each time there's new data written in non-utf8, as the new server is running 8.4, which by design ain't very forgiving. Our plan is to clean-up as we go (remove table from the subscriber, fix it, add it again) and migrate as soon as possible!

Bonus points to those of you getting the convoluted reference :)