On the PostgreSQL Hackers mailing lists, Andrew Dunstan just proposed some new options for pg_dump and pg_restore to ease our lives. One of the answers was talking about some scripts available to exploit the pg_restore listing that you play with using options -l and -L, or the long name versions --list and --use-list. The pg_staging tool allows you to easily exploit those lists too.

The pg_restore list is just a listing of one object per line of all objects contained into a custom dump, that is one made with pg_dump -Fc. You can then tweak this listing in order to comment out some objects (prepending a ; to the line where you find it), and give your hacked file back to pg_restore --use-list so that it will skip them.

What’s pretty useful here, among other things, is that a table will have in fact more than one line in the listing. One is for the TABLE definition, another one for the TABLE DATA. So that pg_staging is able to provide you with options for only restoring some schemas, some schemas_nodata and even some tablename_nodata_regexp, to use directly the configuration options names.

How to do a very simple exclusion of some table’s data when restoring a dump, will you ask me? There we go. Let’s first prepare an environment, where I have only a PostgreSQL server running.

$ git clone git://github.com/dimitri/pg_staging.git                                                         
$ git clone git://github.com/dimitri/pgloader.git
$ for s in */*.sql; do psql -f $s; done
$ pg_dump -Fc > pgloader.dump

Now I have a dump with some nearly random SQL objects in it, let’s filter out the tables named reformat and parallel from that. We will take the sample setup from the pg_staging project. Going the quick route, we will not even change the default sample database name that’s used, which is postgres. After all, the catalog command of pg_staging that we’re using here is a developer command, you’re supposed to be using pg_staging for a lot more services that just this one.

$ cp pg_staging/pg_staging.ini .            
$ (echo "schemas = public";
   echo "tablename_nodata_regexp = parallel,reformat") \
  >> pg_staging.ini            
$ echo "catalog postgres pgloader.dump" \
   | python pg_staging/pg_staging.py -c pg_staging.ini
 ; Archive created at Mon Aug 29 17:17:49 2011
 ;
 ; [EDITED OUTPUT]
 ;
 ; Selected TOC Entries:
 ;
3; 2615 2200 SCHEMA - public postgres
1864; 0 0 COMMENT - SCHEMA public postgres
1536; 1259 174935 TABLE public parallel dimitri
1537; 1259 174943 TABLE public partial dimitri
1538; 1259 174951 TABLE public reformat dimitri
;1853; 0 174935 TABLE DATA public parallel dimitri
1854; 0 174943 TABLE DATA public partial dimitri
;1855; 0 174951 TABLE DATA public reformat dimitri
1834; 2606 174942 CONSTRAINT public parallel_pkey dimitri
1836; 2606 174950 CONSTRAINT public partial_pkey dimitri
1838; 2606 174955 CONSTRAINT public reformat_pkey dimitri

We can see that the objects indeed are skipped, now how to really go about the pg_restore is like that:

$ createdb foo
$ echo "catalog postgres pgloader.dump" \
 |python pg_staging/pg_staging.py -c pg_staging.ini > short.list
$ pg_restore -L short.list -d foo pgloader.dump                                                              

The little bonus with using pg_staging is that when filtering out a schema it will track all tables and triggers from that schema, and also the functions used in the trigger definition. Which is not as easy as it sounds, believe me!

The practical use case is when filtering out PGQ and Londiste, then the PGQ triggers will automatically be skipped by pg_staging rather than polluting the pg_restore logs because the CREATE TRIGGER command could not find the necessary implementation procedure.