As you might have noticed if you've read my blog before, I decided that pgloader needed a full rewrite in order for it to be able to enter the current decade as a relevant tool. pgloader used to be written in the python programming language, which is used by lots of people and generally quite appreciated by its users.
Still, python is not without problems, the main ones I had to deal with being poor performances and lack of threading capabilities. Also, the pgloader setup design was pretty hard to maintain, and adding compatiblity to other loader products from competitors was harder than it should.
After some basic performances checking as seen in my Common Lisp Sudoku Solver project where I did get up to ten times faster code when compared to python, it felt like the amazing set of features of the language could be put to good use here.
So, what about performances after rewrite?
The main reason why I'm now writing this blog post is receiving emails from pgloader users with strange feelings about the speedup. Let's see at the numbers one user gave me, for some data point:
select rows, v2, v3, round(( extract(epoch from v2) / extract(epoch from v3))::numeric, 2) as speedup from timing; rows | v2 | v3 | speedup ---------+-------------------+-----------------+--------- 4768765 | @ 37 mins 10.878 | @ 1 min 26.917 | 25.67 3115880 | @ 36 mins 5.881 | @ 1 min 10.994 | 30.51 3865750 | @ 33 mins 40.233 | @ 1 min 15.33 | 26.82 3994483 | @ 29 mins 30.028 | @ 1 min 18.484 | 22.55 (4 rows)
So what we see in this quite typical CSV Loading test case is a best case of 30 times faster import. Which brings some questions on the table, of course.
Wait, you're still using
pgloader is still using
COPY. This time the protocol implementation
is to be found in the Common Lisp
Postmodern driver, which is really great.
Before that, back when pgloader was python code, it was using the very good
psycopg driver, which also exposes the COPY protocol.
So, what did happen here?
Well it happens that pgloader is now built using Common Lisp technologies, and those are really great, powerful and fast!
Not only is Common Lisp code compiled to machine code when using most Common Lisp Implementations such as SBCL or Clozure Common Lisp; it's also possible to actually benefit from parallel computing and threads in Common Lisp.
pgloader case I've been using the
lparallel utilities, in particular
queuing facility to be able to implement
asynchronous IOs where a thread
reads the source data and preprocess it, fills up a batch at a time in a
buffer that is then pushed down to the writer thread, that handles the
protocol and operations.
So my current analysis is that the new thread based architecture used with a very powerful compiler for the Common Lisp high-level language are allowing pgloader to enter a whole new field of data loading performances.
Rather than a configuration file, the way to use the new pgloader is using a command language that has been designed to look as much like SQL as possible in the pgloader context, to make it easy for its users. Implementation wise, it should now be trivial enough to implement compatibility with other data load software that some PostgreSQL competitor products do have.
Also, the new code base and feature set seems to attract way more users than the previous implementation ever did, despite using a less popular programming language.
You can already
download pgloader binary packages for
centos based ones too, and you will even find a
Mac OS X
package file (
.pkg) that will make
/usr/local/bin/pgloader available for you
on the command line. If you need a windows binary, drop me an email.
The first stable release of the new
pgloader utility is scheduled to be
3.1.0 and to happen quite soon. We are hard at work on packaging the
debian, and you can have a look at the
Quicklisp to debian
project if you want to help us get there!