from MySQL to PostgreSQL
Today pgloader v3.4.1 is released and available! This new release comes with 110 commits as show in github compare view.
This release of pgloader is following the tradition of simplifying things for users, or if you allow me to quote Alan Kay, I believe that if simple things should be simple, complex things should be possible.
This article mentions pgloader features that are new as of version 3.4.1,
check which version you’re running with pgloader --version
. It is easy
enough to build pgloader from sources, and pgloader is packaged in debian
and derivatives, and in RPM based distributions too.
Have a look at http://apt.postgresql.org and https://yum.postgresql.org/ for up to date packaged options in your distribution of choice and your PostgreSQL version of choice.
Table of Contents
Migrate from MySQL to PostgreSQL
An example of pgloader simplicity is that you can migrate a whole MySQL database, including its schema definition of tables and indexes, primary key and foreign key constraints, comments and default values, even when they require installing a trigger in PostgreSQL, all of that in a single command line:
$ pgloader mysql://root@localhost/f1db pgsql:///f1db
2017-07-06T17:57:04.679000+02:00 LOG report summary reset
table name read imported errors total time
------------------------- --------- --------- --------- --------------
fetch meta data 33 33 0 0.206s
Create Schemas 0 0 0 0.026s
Create SQL Types 0 0 0 0.007s
Create tables 26 26 0 0.057s
Set Table OIDs 13 13 0 0.005s
------------------------- --------- --------- --------- --------------
f1db.circuits 73 73 0 0.093s
f1db.constructorresults 11011 11011 0 0.139s
f1db.constructors 208 208 0 0.047s
f1db.drivers 841 841 0 0.073s
f1db.constructorstandings 11766 11766 0 0.237s
f1db.laptimes 413578 413578 0 2.643s
f1db.driverstandings 31420 31420 0 0.392s
f1db.pitstops 5796 5796 0 0.132s
f1db.races 976 976 0 0.076s
f1db.seasons 68 68 0 0.053s
f1db.qualifying 7257 7257 0 0.091s
f1db.results 23514 23514 0 0.431s
f1db.status 133 133 0 0.137s
------------------------- --------- --------- --------- --------------
COPY Threads Completion 4 4 0 3.265s
Create Indexes 20 20 0 1.760s
Index Build Completion 20 20 0 0.715s
Reset Sequences 10 10 0 0.029s
Primary Keys 13 13 0 0.010s
Create Foreign Keys 0 0 0 0.000s
Create Triggers 0 0 0 0.000s
Install Comments 0 0 0 0.000s
------------------------- --------- --------- --------- --------------
Total import time 506641 506641 0 4.888s
You can easily reproduce that at home with the Ergast database, a historical record of motor racing data for non-commercial purposes. Download the f1db.sql.gz MySQL 5.1 database dump from there, install it in some MySQL instance and then run the command above.
To be complete, let’s mention that pgloader will not create the target database, so you have to do that yourself. Also, currently pgloader doesn’t change the search_path of the target database, so you might want to do that yourself. Which means you need the following 3 commands if we wanted to be exhaustive (the migration itself only happens at the pgloader step):
$ createdb f1db
$ pgloader mysql://root@localhost/f1db pgsql:///f1db
$ psql -d f1db -c 'ALTER DATABASE f1db SET search_path TO f1db, public;'
pgloader needs a live MySQL database to operate its magic. It is less error prone to query system’s catalogs than to learn how to parse all those SQL dialects.
You might have seen that the automated migration targets a PostgreSQL schema with the same name as the MySQL database. That’s because in MySQL there’s not really a notion of SQL standard catalogs, which are called databases by most of the industry. What MySQL names a database actually is what the standard (and PostgreSQL) refers to as a schema.
MySQL spatial indexes over geometry
Starting in this new release of pgloader, MySQL spatial keys are automatically converted to PostgreSQL GiST indexes. In previous versions of pgloader, when playing with the Sakila Sample Database you would have the following error:
ERROR PostgreSQL Database error 42704: data type point has no default operator class for access method "btree"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
QUERY: CREATE INDEX idx_188012_idx_location ON pagila.address (location);
In that case what we should do is issue to PostgreSQL the following CREATE INDEX command instead:
CREATE INDEX idx_188012_idx_location ON pagila.address USING gist(location);
And guess what? yes of course this is a paste from a run of current pgloader’s version, where this is implemented in a transparent way for the users. Also, rather than hardcoding which PostgreSQL types need a special treatment here, pgloader runs the following SQL query against the PostgreSQL catalogs, as found in the source tree at src/pgsql/sql/list-typenames-without-btree-support.sql:
select typname,
array_agg(amname order by amname <> 'gist', amname <> 'gin')
from pg_type
join pg_opclass on pg_opclass.opcintype = pg_type.oid
join pg_am on pg_am.oid = pg_opclass.opcmethod
where substring(typname from 1 for 1) <> '_'
and not exists
(
select amname
from pg_am am
join pg_opclass c on am.oid = c.opcmethod
join pg_type t on c.opcintype = t.oid
where amname = 'btree' and t.oid = pg_type.oid
)
group by typname;
The typical result of the query is following. The important part of discovering data types without btree support at run time is that pgloader then will support locally installed extensions it knows nothing about, and that you can target in the user-defined casting rules (see below).
typname │ array_agg
════════════╪══════════════════════
aclitem │ {hash}
box │ {gist,brin,spgist}
cid │ {hash}
circle │ {gist}
int2vector │ {hash}
point │ {gist,spgist,spgist}
polygon │ {gist}
xid │ {hash}
(8 rows)
In version 3.4.1 of pgloader this feature is limited to single column indexes. If you need to support multiple-columns indexes, let me know by opening an issue.
Migrate from SQLite to PostgreSQL
In much the same vein it is possible to migrate a full SQLite database into PostgreSQL in a single command line. This time, as SQLite databases can be distributed in a single file, we can even have pgloader download a file over HTTP for us. Here’s a full example using the chinook database, where again pgloader takes care of the schema definition, indexes, primary keys, foreign keys, default values, comments… the whole thing:
$ pgloader https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite pgsql:///chinook
2017-07-06T18:16:52.256000+02:00 LOG Fetching 'https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite_AutoIncrementPKs.sqlite'
2017-07-06T18:16:54.118000+02:00 ERROR PostgreSQL Database error 42P16: multiple primary keys for table "playlisttrack" are not allowed
QUERY: ALTER TABLE playlisttrack ADD PRIMARY KEY USING INDEX idx_189226_sqlite_autoindex_playlisttrack_1;
2017-07-06T18:16:54.119000+02:00 LOG report summary reset
table name read imported errors total time
----------------------- --------- --------- --------- --------------
fetch 0 0 0 1.305s
fetch meta data 33 33 0 0.034s
Create Schemas 0 0 0 0.001s
Create SQL Types 0 0 0 0.006s
Create tables 22 22 0 0.096s
Set Table OIDs 11 11 0 0.007s
----------------------- --------- --------- --------- --------------
album 347 347 0 0.020s
artist 275 275 0 0.017s
customer 59 59 0 0.024s
genre 25 25 0 0.024s
invoiceline 2240 2240 0 0.049s
employee 8 8 0 0.018s
invoice 412 412 0 0.032s
playlist 18 18 0 0.026s
mediatype 5 5 0 0.030s
playlisttrack 8715 8715 0 0.064s
track 3503 3503 0 0.131s
----------------------- --------- --------- --------- --------------
COPY Threads Completion 4 4 0 0.188s
Create Indexes 22 22 0 0.168s
Index Build Completion 22 22 0 0.066s
Reset Sequences 0 0 0 0.029s
Primary Keys 12 11 1 0.013s
Create Foreign Keys 11 11 0 0.042s
Create Triggers 0 0 0 0.001s
Install Comments 0 0 0 0.000s
----------------------- --------- --------- --------- --------------
Total import time 15607 15607 0 2.123s
The SQLite catalog inspection isn’t as straightforward as one would want to, and it’s been difficult to understand when exactly is an index a primary key index. The documentation of ROWIDs and the INTEGER PRIMARY KEY did sched some light on the issues and we’re good now, it seems. Well I’m not too sure about the error you see above, complaining that multiple primary keys for table “playlisttrack” are not allowed. We might still have some SQLite catalogs introspection hacking to do to be all good.
Migrate from MS SQL to PostgreSQL
The new release of pgloader also comes with lots of small improvements to the MS SQL support, where it now mostly just works. That’s thanks to users who did provide me with a test case to reproduce then fix the bugs. That really is a life changer, in particular with proprietay software, where I can’t have a local copy running for my tests.
Load CSV files
At the beginning, pgloader was meant to load data from CSV files into PostgreSQL. For a large and flexible definition of CSV. Now we include support for the COPY file format, fixed width files, dBase files and IXF files.
The load command and new options
As mentioned earlier simple things are simple, and sometimes you need to do more complex things, and the advanced use cases are handled by the pgloader command language. This is a DSL compiled to binary at pgloader’s run time.
The ORM case: migrating to an existing schema
A classic complex case is when you’re using an ORM that handles your database schema for your application. It usually makes migrating easier, and every thing else more difficult, so I don’t recommend using ORMs. See my How to write SQL article in this blog for more details. And maybe the old and good The Vietnam of Computer Science too, if you haven’t already.
Anyway, disliking ORMs is not reason enough to leave users out in the cold. Of course pgloader supports your use case, and with the data only option you can have it run a migration by discovering both the source and target schemas and then working on a merge with the data from the source.
New in this release is the ability to target a pre-existing schema and still benefit from the pgloader indexes processing, which is a very nice boost. When loading lots of data into PostgreSQL it’s best to:
- remove existing indexes, thus foreign keys that depend on them,
- load data,
- create indexes again, all at once in parallel,
- upgrade unique indexes into primary keys with ALTER TABLE,
- reinstall foreign key constraints with indexes in place.
And of couse, as it’s the best way to do things, that’s exactly what pgloader is doing. Now, even when the indexes are pre-defined in the target schema, provided that you use the drop indexes option to the WITH clause.
Note that you can still use it on the command line, without a command file, as in this example:
$ pgloader --with "data only" \
--with truncate \
--with "drop indexes" \
--verbose \
mysql://root@localhost/f1db \
pgsql:///f1db
For this command to work, the f1db PostgreSQL target database must have been defined before, with the right set of SQL objects of course. Which is exactly what ORMs are doing for you, I’ve been told.
Schema mapping options
In the command language it is possible to map schema names and table names from your source to your target database, and to add options to the load scenario, as in the following example:
load database
from mysql://root@localhost/sakila
into postgresql:///sakila
ALTER TABLE NAMES MATCHING 'sales_by_store' RENAME TO 'sales_by_store_list'
ALTER TABLE NAMES MATCHING 'film' RENAME TO 'films'
ALTER TABLE NAMES MATCHING ~/./ SET (fillfactor='40')
ALTER SCHEMA 'sakila' RENAME TO 'pagila'
Here the ALTER commands are not executed in any database, neither the source not the target. They are used as a mapping specifications from the source to the target, so that we create tables found in the sakila schema in the source into the pagila schema in the target.
Also you can see that we change the fillfactor for all tables in the target. The ALTER clauses in the pgloader command are realized over the internal pgloader catalogs, so that the actual implementation is done at CREATE TABLE time in PostgreSQL.
User defined casting
The new version is not adding things to the capability of pgloader to allow for user defined type casting from the source to the target database. Some default cast rules have been improved, and you can read pgloader reference documentation for details.
MySQL multiple readers support
An experimental feature made it to this release of pgloader, and it’s currently only available for MySQL source type. It is now possible for pgloader to have several concurrent readers working on the same source table.
The idea is that they split the reading by using WHERE clauses, each over a range of the data. That allows for MySQL Index Range Scan as documented at their Range Optimization documentation page.
In my testing I didn’t have access to a machine with enough CPUs to make this optimisation worthwile, but it might still be useful. Have a try if you need it, and tell me if it helped!
Here’s an example of a setup that uses the multiple readers per thread feature, new in pgloader v3.4.1:
load database
from mysql://root@localhost/sakila
into postgresql:///sakila
WITH concurrency = 2, workers = 6,
prefetch rows = 25000,
multiple readers per thread, rows per range = 50000;
Conclusion
Now that you’re using PostgreSQL, enjoy a powerful SQL dialect with plenty of advanced functions to process your data exactly the way you need it!
You might be interested into Exploring a Data Set in SQL if you’re new to PostgreSQL or the database you’ve just migrated.
Also if it makes sense for your business to contribute financially to pgloader’s development, consider buying a pgloader Moral License. Cases when it makes sense include successful migrations on a low budget and within the time constraints thanks to using pgloader! You may then (at your choice) join the pgloader sponsors page!