Wednesday, May 14 2014
PostgreSQL, pgloader, Common-Lisp

Why is pgloader so much faster?

pgloader loads data into PostgreSQL. The new version is stable enough nowadays that it's soon to be released, the last piece of the 3.1.0 puzzle being full debian packaging of the tool.

The pgloader logo is a loader truck, just because.

As you might have noticed if you've read my blog before, I decided that pgloader needed a full rewrite in order for it to be able to enter the current decade as a relevant tool. pgloader used to be written in the python programming language, which is used by lots of people and generally quite appreciated by its users.

Why changing

Still, python is not without problems, the main ones I had to deal with being poor performances and lack of threading capabilities. Also, the pgloader setup design was pretty hard to maintain, and adding compatiblity to other loader products from competitors was harder than it should.

As I said in my pgloader lightning talk at the 7th European Lisp Symposium last week, in searching for a modern programming language the best candidate I found was actually Common Lisp.

After some basic performances checking as seen in my Common Lisp Sudoku Solver project where I did get up to ten times faster code when compared to python, it felt like the amazing set of features of the language could be put to good use here.

So, what about performances after rewrite?

The main reason why I'm now writing this blog post is receiving emails from pgloader users with strange feelings about the speedup. Let's see at the numbers one user gave me, for some data point:

 select rows, v2, v3,
        round((  extract(epoch from v2)
               / extract(epoch from v3))::numeric, 2) as speedup
   from timing;
        
  rows   |        v2         |       v3        | speedup 
---------+-------------------+-----------------+---------
 4768765 | @ 37 mins 10.878  | @ 1 min 26.917  |   25.67
 3115880 | @ 36 mins 5.881   | @ 1 min 10.994  |   30.51
 3865750 | @ 33 mins 40.233  | @ 1 min 15.33   |   26.82
 3994483 | @ 29 mins 30.028  | @ 1 min 18.484  |   22.55
(4 rows)
The raw numbers have been loaded into a PostgreSQL table

So what we see in this quite typical CSV Loading test case is a best case of 30 times faster import. Which brings some questions on the table, of course.

Wait, you're still using COPY right?

The PostgreSQL database system provides a really neat COPY command, which in turn is only exposing the COPY Streaming Protocol, that pgloader is using.

So yes, pgloader is still using COPY. This time the protocol implementation is to be found in the Common Lisp Postmodern driver, which is really great. Before that, back when pgloader was python code, it was using the very good psycopg driver, which also exposes the COPY protocol.

So, what did happen here?

Well it happens that pgloader is now built using Common Lisp technologies, and those are really great, powerful and fast!

Not only is Common Lisp code compiled to machine code when using most Common Lisp Implementations such as SBCL or Clozure Common Lisp; it's also possible to actually benefit from parallel computing and threads in Common Lisp.

That's not how I did it!

In the pgloader case I've been using the lparallel utilities, in particular its queuing facility to be able to implement asynchronous IOs where a thread reads the source data and preprocess it, fills up a batch at a time in a buffer that is then pushed down to the writer thread, that handles the COPY protocol and operations.

So my current analysis is that the new thread based architecture used with a very powerful compiler for the Common Lisp high-level language are allowing pgloader to enter a whole new field of data loading performances.

Conclusion

Not only is pgloader so much faster now, it's also full of new capabilities and supports several sources of data such as dBase files, SQLite database files or even MySQL live connections.

Rather than a configuration file, the way to use the new pgloader is using a command language that has been designed to look as much like SQL as possible in the pgloader context, to make it easy for its users. Implementation wise, it should now be trivial enough to implement compatibility with other data load software that some PostgreSQL competitor products do have.

Also, the new code base and feature set seems to attract way more users than the previous implementation ever did, despite using a less popular programming language.

You can already download pgloader binary packages for debian based distributions and centos based ones too, and you will even find a Mac OS X package file ( .pkg) that will make /usr/local/bin/pgloader available for you on the command line. If you need a windows binary, drop me an email.

The first stable release of the new pgloader utility is scheduled to be named 3.1.0 and to happen quite soon. We are hard at work on packaging the dependencies for debian, and you can have a look at the Quicklisp to debian project if you want to help us get there!