Monday, August 01 2011
PostgreSQL, pgloader

Parallel pgloader

This article continues the series that began with How To Use PgLoader then detailed How to Setup pgloader. We have some more fine points to talk about here, today's article is about loading your data in parallel with pgloader.

several files at a time

Parallelism is implemented in 3 different ways in pgloader. First, you can load more than one file at a time thanks to the max_parallel_sections parameter, that has to be setup in the global section of the file.

This setting is quite simple and already allows the most common use case.

several workers per file

The other use case is when you have huge files to load into the database. Then you want to be able to have more than one process reading the file at the same time. Using pgloader, you already did the compromise to load the whole content in more than one transaction, so there's no further drawback here about having those multiple transactions per file spread to more than one load worker.

There are basically two ways to split the work between several workers here, and both are implemented in pgloader.

N workers, N splits of the file

section_threads    = 4
split_file_reading = True

Setup this way, pgloader will launch 4 different threads (see the caveat section of this article). Each thread is then given a part of the input data file and will run the whole usual pgloader processing on its own. For this to work you need to be able to seek in the input stream, which might not always be convenient.

one reader, N workers

section_threads    = 4
split_file_reading = False
rrqueue_size       = 5000

With such a setup, pgloader will start 4 different worker threads that will receive the data input in an internal python queue. Another active thread will be responsible of reading the input file and filling the queues in a round robin fashion, but will hand all the processing of the data to each worker, of course.

how many threads?

If you're using a mix and match of max_parallel_sections and section_threads with split_file_reading set to True of False, it's uneasy to know exactly how many threads will run at any time in the loading. How to ascertain which section will run in parallel when it depends on the timing of the loading?

The advice here is the usual one, don't overestimate the capabilities of your system unless you are in a position to check before by doing trial runs.

caveat

Current implementation of all the parallelism in pgloader has been done with the python threading API. While this is easy enough to use when you want to exchange data between threads, it's suffering from the Global Interpreter Lock issue. This means that while the code is doing its processing in parallel, the runtime not so much. You might still benefit from the current implementation if you have hard to parse files, or custom reformat modules that are part of the loading bottleneck.

future

The solution would be to switch to using the newer python multiprocessing API, and some preliminary work has been done in pgloader to allow for that. If you're interested in real parallel bulk loading, contact-me!