pgloader: what's next?
pgloader is a tool to help loading data into
PostgreSQL, adding some error
management to the
COPY is the fast way of loading data into
PostgreSQL and is transaction safe. That means that if a single error
appears within your bulk of data, you will have loaded none of it.
will submit the data again in smaller chunks until it’s able to isolate the
bad from the good, and then the good is loaded in.
In a recent migration project where we freed data from MySQL into
PostgreSQL, we used
pgloader again. But the loading time was not fast enough
for the service downtime window that we had here. Indeed
Python is not known
for being the fastest solution around. It’s easy to use and to ship to
production, but sometimes you not only want to be able to be efficient when
writing code, you also need the code to actually run fast too.
Faster data loading
So I began writing a little dedicated tool for that migration in Common Lisp which is growing on me as my personal answer to the burning question: python 2 or python 3? I find Common Lisp to offer an even more dynamic programming environment, an easier language to use, and the result often has performances characteristics way beyond what I can get with python. Between 5 times faster and 121 times faster in some quite stupid benchmark.
Here, with real data, my one shot attempt has been running more than twice as fast as the python version, after about a day of programming.
The other thing here is that I’ve tempted to get
pgloader work in parallel,
but at the time I didn’t know about the
Global Interpreter Lock that they
didn’t find how to remove in Python 3 still, by the way. So my threading
attempts at making
pgloader work in parallel are pretty useless.
Whereas in Common Lisp I can just use the lparallel lib, which exposes threading facilities and some queueing facilities as a mean to communicate data in between workers, and have my code easily work in parallel for real.
The only drawback that I can see here is that if you’ve been writing your
reformating modules in python for
pgloader (yes you can
implement your own reformating module for pgloader), then you would have to
port it to
Common Lisp. Shout me an email if that’s your case.
So, I think we’re going to have a
pgloader 3 someday, that will be way
faster than the current one, and bundle some more features: real parallel
behavior, ability to fetch non local data (connecting to MySQL directly, or
HTTP, S3, etc); and I’m thinking about offering a
COPY like syntax to drive
the loading too, while at it. Also, the ability to discover the set of data
to load all by itself when you want to load a whole database: think of it as
Migration mode of operations.
Some feature requests can’t be solved easily when keeping the old
syntax cruft, so it’s high time to implement some kind of a real command
language. I have several ideas about those, in between the
COPY syntax and
SQL*Loader configuration format, which is both clunky and quite
After a beginning in
TCL and a complete rewrite in python in
2005, it looks
2013 is going to be the year of
pgloader 3, in