PostgreSQL data recovery

The following story is only interesting to read if you like it when bad things happen, or if you don’t have a trustworthy backup policy in place. By trustworthy I mean that each backup you take must be tested with a test recovery job. Only tested backups will prove useful when you need them. So go read our Backup and Restore documentation chapter then learn how to setup Barman for handling physical backups and Point In Time Recovery. Get back when you have proper backups, including recovery testing in place. We are waiting for you. Back? Ok, let’s see how bad you can end up without backups, and how to still recover. With luck.

*Set a trustworthy backup solution, and review your policy*

Did I mention that a trustworthy backup solution includes automated testing of your ability to recover from any and all backup you’ve been taking? That might be more important than you think.

This article is going to be quite long. Prepare yourself a cup of your favorite beaverage.

The Setup

Most of the customers I visit have already laid out a backup strategy and implemented it. Most of them did implement it with custom in-house scripts. They hire high-skilled engineers who have been doing system administration for more than a decade, and who are more than able to throw a shell script at the problem.

Shell scripting must in hindsight be one of the most difficult things to do right, given how many times it turns around doing something else entirely than what the program author though it would. If you want another one of my quite bulk advices, stop doing any shell scripting today: a shell is a nice interactive tool, if you are doing non-interactive scripting, that’s actually called system programming and you deserve a better tool than that.

*My take: shell script makes it harder to write production quality code.*

In our very case, the customer did realize that a production setup had been approved and was running live before any backup solution was in place. Think about it for a minute. If you don’t have tested backups in place, it’s not production ready.

Well, the incident was taken seriously, and the usual backup scripts deployed as soon as possible. Of course, the shell scripts depended in non-obvious ways on some parameters (environment variables, database connections, database setup with special configuration tables and rows). And being a shell script, not much verification that the setup was complete had been implemented, you see.

The Horror Story

And guess what the first thing that backup script is doing? Of course, making sure enough place is available on the file system to handle the next backup. That’s usually done by applying a retention policy and first removing backups that are too old given said policy. And this script too did exactly that.

The problem is that, as some of you already guessed (yes, I see those smiles trying to hide brains thinking as fast as possible to decide if the same thing could happen to you too), well, the script configuration had not been done before entering production. So the script ran without setup, and without much checking, began making bytes available. By removing any file more than 5 days old. Right. In. $PGDATA.

But recently modified are still there, right?

Exactly, not all the files of the database system had been removed. Surely something can be done to recover data from a very small number of important tables? Let’s now switch to the present tense and see about it.

*Can you spell ***data loss***?*

Remember, there’s no backups. The archive_command is set though, so that’s a first track to consider. After that, what we can do is try to start PostgreSQL on a copy of the remaining $PGDATA and massage it until it allows us to COPY the data out.

The desperate PITR

The WAL Archive is starting at the file 000000010000000000000009, which makes it unusable without a corresponding base backup, which we don’t have. Well, unless maybe if we tweak the system. We need to first edit the system identifier, then reset the system to only begin replaying at the file we do have. With some luck…

*A broken clock is still right twice a day, a broken backup never is...*

Time to try our luck here:

$ export PGDATA=/some/place
$ initdb
$ hexedit $PGDATA/global/pg_control
$ pg_controldata
$ xlogdump /archives/000000010000000000000009
$ pg_resetxlog -f -l 1,9,19 -x 2126 -o 16667 $PGDATA
$ cat > $PGDATA/recovery.conf <<EOF
restore_command = 'gunzip -c /archives/%f.gz > "%p"'
EOF
$ pg_ctl start

Using the transaction data we get from reading the first archive log file we have with xlogdump then using pg_resetxlog and thus accepting to maybe lose some more data, we still can’t start the system in archive recovery mode, because the system identifier is not the same in the WAL files and in the system’s pg_controldata output.

So we tweak our fresh cluster to match, by changing the first 8 bytes of the control file, paying attention to the byte order here. As I already had a Common Lisp REPL open on my laptop, the easier for me to switch from decimal representation of the database system identifier was so:

(format nil "~,,' ,2:x" 5923145491842547187)
"52 33 3D 71 52 3B 3D F3"

Paying attention to the byte order means that you need to edit the control file’s first 8 bytes in reverse: F3 3D 3B 52 71 3D 33 52. But in our case, no massaging seems to allow PostgreSQL to read from the archives we have.

On to massaging what is remaining in the old cluster then.

The Promotion

I’m usually not doing promotion in such a prominent way, but I clearly solved the situation thanks to my colleagues from the 24/7 support at 2ndQuadrant, with a special mention to Andres Freund for inspiration and tricks:

*We also know how to recover your data, but we first insist in proper backups*

Oh, did I mention about proper backups and how you need to have been successfully testing them before you can call a service in production or have any hope about your recovery abilities? I wasn’t sure I did…

Playing fast and loose with PostgreSQL

The damaged cluster is not starting, for lack of important meta-data kind-of files. First thing missing is pg_filenode.map in the global directory. Using xlogdump it should be possible to recover just this file if it’s been changed in the WAL archives we have, but that’s not the case.

*Trying to salvage a damage case*

pg_filenode.map

As this file is only used for shared relations and some bootstraping situation (you can’t read current pg_class file node from pg_class, as the file mapping is the information you need to know which file to read), and knowing that the version on disk was older than 5 days on a cluster recently put into production, we can allow ourselves trying something: copy the pg_filenode.map from another fresh cluster.

My understanding is that this file only changes when doing heavy maintenance on system tables, like CLUSTER or VACUUM FULL, which apparently didn’t get done here.

By the way, here’s one of those tricks I learnt in this exercise. You can read the second and fourth columns as filenames in the same directory:

od -j 8 -N $((512-8-8)) -td4 < $PGDATA/global/pg_filenode.map

So copying default pg_filenode.map allowed us to pass that error and get to the next.

pg_clog

Next is the lack of some pg_clog files. That’s a little tricky because those binary files contain the commit log information and are used to quickly decide if recent transactions are still in flight, or committed already, or have been rolled back. We can easily trick the system and declare that all transaction older than 5 days (remember the bug in the cleanup script was about that, right?) have in fact been committed. A commit in the CLOG is a 01 value, and in a single byte we can stuff as many as 4 transactions' status.

Here’s how to create those file from scratch, once you’ve noticed that 01010101 is in fact the ascii code for the letter U.

(code-char #b01010101)
#\U

So to create a series of clog file where all transactions have been committed, so that we can see the data, we can use the following command line:

for c in 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 000A 000B 000C
do
    dd if=/dev/zero bs=256k count=1 | tr '\0' 'U' > $c
done

pg_database

The next step we are confronted to is that PostgreSQL has lost its baking files for the pg_database relation and has no idea what are those directories in $PGDATA/base supposed to be all about. We only have the numbers!

That said, the customer still had an history of the commands used to install the database server, so knew in which order the databases where created. So we had an OID to name mapping. How to apply it?

Well pg_database is a shared catalog and the underlying file apparently isn’t that easy to hack around, so the easiest solution is to actually hack the CREATE DATABASE command and have it accepts a WITH OIDS option ( OIDS is already a PostgreSQL keyword, OID is not, and we’re not going to introduce new keywords just for that particular patch).

Equiped with that hacked version of PostgreSQL it’s then possible to use the new command and create the databases we need with the OIDS we know.

Those OIDS are then to be found on-disk in the file where pg_database is internally stored, and we can ask the system where that file is:

select oid, relname, pg_relation_filenode(oid)
  from pg_class
 where relname = 'pg_database';
 oid  |   relname   | pg_relation_filenode 
------+-------------+----------------------
 1262 | pg_database |                12319
(1 row)

Then without surprise we can see:

$ strings $PGDATA/global/12319
postgres
template0
template1

Once that file is copied over to the (running, as it happened) damaged cluster, it’s then possible to actually open a connection to a database. And that’s pretty impressive. But suddenly it didn’t work anymore…

Sytem Indexes

This problem was fun to diagnose. The first psql call would be fine, but the second one would always complain with an error you might have never seen in the field. I sure didn’t before.

FATAL:  database "dbname" does not exists
DETAIL:  Database OID 17838 now seems to belong to "otherdbname"

Part of PostgreSQL startup is building up some caches, and for that it’s using indexes. And we might have made a mistake, or the index is corrupted, but apparently there’s a mismatch somewhere.

But your now all-time favourite development team knew that would happen to you and is very careful that any feature included in the software is able to bootstrap itself without using any indexes. Or that in bad situations the system knows how to resist the lack of those indexes by turning the feature off, which is the case for Event Triggers for example, as you can see in the commit cd3413ec3683918c9cb9cfb39ae5b2c32f231e8b.

*Another kind of indexing system*

So yes, it is indeed possible to start PostgreSQL and have that marvellous production ready system avoid any system indexes, for dealing with cases where you have reasons to think those are corrupted… or plain missing.

$ pg_ctl start -o "-P"
$ cat > $PGDATA/postgresql.conf <<EOF
	enable_indexscan = off
	enable_bitmapscan = off
	enable_indexonlyscan = off
EOF
$ pg_ctl reload

While at it, we edit the postgresq.conf and adjust some index usage related settings, as you can see, because this problem will certain happen outside of the system indexes.

If you’re not using (only) PostgreSQL as your database system of choice, now is the time to check that you can actually start those other systems when their internal indexes are corrupted or missing, by the way. I think that tells a lot about the readiness of the system for production usage, and the attitude of the developpers towards what happens in

So we now have a running PostgreSQL service, servicing the data that still is available. Well, not quite, We have a PostgreSQL service that accepts to start and allows connections to a specific database.

pg_proc, pg_operator, pg_cast, pg_aggregate, pg_amop and others

The first query I did try on the new database was against pg_class to get details about the available tables. The psql command line tool is doing a large number of queries in order to serve the \d output, the \dt one is usable in our case.

To know what queries are sent to the server by psql commands use the \set ECHO_HIDDEN toggle.

About any query is now complaining that the target database is missing files. To understand which file it is, I used the following query in a fresh cluster. The following example is about an error message where base/16384/12062 is missing:

select oid, relname, pg_relation_filenode(oid)
  from pg_class
 where pg_relation_filenode(oid) = 12062;
 oid  | relname | pg_relation_filenode 
------+---------+----------------------
 1255 | pg_proc |                12062
(1 row)

In our specific case, no extensions were used. Check that before taking action here, or at least make sure that the tables you want to try and recover data from are not using extensions, that would make things so much more complex.

Here we can just use default settings for most of the system catalogs: we are using the same set of functions, operators, casts, aggregates etc as any other 9.2 system, so we can directly use files created by initdb and copy them over where the error message leads.

pg_namespace

Some error messages are about things we should definitely not ignore. The content of the pg_namespace relation was lost on about all our databases, and the application here were using non default schema.

To recover from that situation, we need to better understand how this relation is actually stored:

# select oid, * from pg_namespace;
  oid  |      nspname       | nspowner |        nspacl        
-------+--------------------+----------+----------------------
    99 | pg_toast           |       10 | 
 11222 | pg_temp_1          |       10 | 
 11223 | pg_toast_temp_1    |       10 | 
    11 | pg_catalog         |       10 | {dim=UC/dim,=U/dim}
  2200 | public             |       10 | {dim=UC/dim,=UC/dim}
 11755 | information_schema |       10 | {dim=UC/dim,=U/dim}
(6 rows)

# copy pg_namespace to stdout with oids;
99	pg_toast	10	\N
11222	pg_temp_1	10	\N
11223	pg_toast_temp_1	10	\N
11	pg_catalog	10	{dim=UC/dim,=U/dim}
2200	public	10	{dim=UC/dim,=UC/dim}
11755	information_schema	10	{dim=UC/dim,=U/dim}

So it’s pretty easy here, actually, when you make the right connections: let’s import a default pg_namespace file then append to it thanks to COPY IN, being quite careful about using tabs (well, unless you use the delimiter option of course):

# copy pg_namespace from stdin with oids;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 16443	my_namespace	10	\N
>> \.

And now there’s a new schema in there with the OID we want. Wait, how do we figure out the OID we need?

# select c.oid, relname, relnamespace, nspname
    from pg_class c left join pg_namespace n on n.oid = c.relnamespace
   where relname = 'bar';
  oid  | relname | relnamespace | nspname 
-------+---------+--------------+---------
 16446 | bar     |        16443 | 
(1 row)

So in the result of that query we have no nspname, but we happen to know that the table bar is supposed to be in the schema my_namespace.

And believe it or not, that method actually allows you to create a schema in a database in a running cluster. We directly are editing the catalog files and editing even the OID of the rows we are injecting.

The reason we couldn’t do that with pg_database, if you’re wondering about that, is that pg_database is a shared catalog and part of the bootstrapping, so that it was impossible to start PostgreSQL until we fix it, and the only implementation of COPY we have requires a running PostgreSQL instance.

pg_attributes and pg_attrdef

So now we are able to actually refer to the right relation in a SQL command, we should be able to dump its content right? Well, it so happens that in some case it’s ok and in some cases it’s not.

We are very lucky in that exercise in that pg_attribute is not missing. We might have been able to rebuild it thanks to some pg_upgrade implementation detail by forcing the OID of the next table to be created and then issuing the right command, as given by pg_dump. By the way, did I mention about backups? and automated recovery tests?

*We need the data attributes*

In some cases though, we are missing the pg_attrdef relation, wholesale. That relation is used for default expressions attached to columns, as we can see in the following example, taken on a working database server:

# \d a
                         Table "public.a"
 Column |  Type   |                   Modifiers                    
--------+---------+------------------------------------------------
 id     | integer | not null default nextval('a_id_seq'::regclass)
 f1     | text    | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)

#  select adrelid, adnum, adsrc
     from pg_attrdef
    where adrelid = 'public.a'::regclass;
 adrelid | adnum |             adsrc             
---------+-------+-------------------------------
   16411 |     1 | nextval('a_id_seq'::regclass)
(1 row)

# select attnum, atthasdef
    from pg_attribute
   where     attrelid = 'public.a'::regclass
         and atthasdef;
 attnum | atthasdef 
--------+-----------
      1 | t
(1 row)

We need to remember that the goal here is to salvage some data out of an installation where lots is missing, it’s not at all about being able to ever use that system again. Given that, what we can do here is just ignore the default expression of the columns, by directly updating the catalogs:

# update pg_attribute
     set atthasdef = false
   where attrelid = 'my_namespace.bar';

COPY the data out! now!

At this point we are now able to actually run the COPY command to store the interesting data into a plain file, that is going to be usable on another system for analysis.

Not every relation from the get go, mind you, sometime some default catalogs are still missing, but in that instance of the data recovery we were able to replace all the missing pieces of the puzzle by just copying the underlying files as we did in the previous section.

Conclusion

Really, PostgreSQL once again surprises me by its flexibility and resilience. After having tried quite hard to kill it dead, it was still possible to actually rebuild the cluster into shape piecemeal and get the interesting data back.

I should mention, maybe, that with a proper production setup including a Continuous Archiving and Point-in-Time Recovery solution such as pgbarman, walmgr, OmniPITR or PITRtools; the recovery would have been really simple.

Using an already made solution is often better because they don’t just include

It’s even one of those rare cases where using PostgreSQL replication would have been a solution: the removing of the files did happen without PostgreSQL involved, it didn’t know that was happening and wouldn’t have replicated that to the standby.