3 Articles tagged “Encoding”

Continuing our series of PostgreSQL Data Types today we’re going to introduce the PostgreSQL text data type. The first notion to understand when processing text in any program is of course the notion of encoding.

So when addressing the text datatype we must mention encoding settings, and possibly also issues. An encoding is a particular representation of characters in bits and bytes. In the ASCII encoding the letter A is encoded as the 7-bits byte 1000001, or 65 in decimal, or 41 in hexadecimal. All those numbers are going to be written the same way on-disk, and the letter A too.

So, if you followed the previous blog entry, now you have a new database containing all the static tables encoded in UTF-8 rather than SQL_ASCII. Because if it was not yet the case, you now severely distrust this non-encoding.

Now is the time to have a look at properly encoding the live data, those stored in tables that continue to receive write traffic. The idea is to use the UPDATE facilities of PostgreSQL to tweak the data, and too fix the applications so as not to continue inserting badly encoded strings in there.

It happens that you have to manage databases designed by your predecessor, and it even happens that the team used to not have a DBA. Those histerical raisins can lead to having a SQL_ASCII database. The horror! What SQL_ASCII means, if you’re not already familiar with the consequences of such a choice, is that all the text and varchar data that you put in the database is accepted as-is. No checks.

Dimitri Fontaine

PostgreSQL Major Contributor

Open Source Software Engineer