This file is expected in the
INI format, with a
global section then one
section per file you want to import. The
global section defines some
default options and how to connect to the
The configuration setup is fully documented on the pgloader man page that you can even easily find online. As all unix style man pages, though, it's more a complete reference than introductory material. Let's review.
Here's the global section of the examples/pgloader.conf file of the source files. Well, some options are debugger only options, really, so I changed their value so that what you see here is a better starting point.
[pgsql] base = pgloader log_file = /tmp/pgloader.log log_min_messages = INFO client_min_messages = WARNING lc_messages = C pg_option_client_encoding = 'utf-8' pg_option_standard_conforming_strings = on pg_option_work_mem = 128MB copy_every = 15000 null = "" empty_string = "\ " max_parallel_sections = 4
You don't see all the connection setup, here
base was enough. You might
need to setup
user, and maybe even
pass, too, to be able to
connect to the PostgreSQL server.
The logging options allows you to set a file where to log all
messages, that are categorized as either
CRITICAL. The options
client_min_messages are another
good idea stolen from
PostgreSQL and allow you to setup the level of chatter
you want to see on the interactive console (standard output and standard
error streams) and on the log file.
Please note that the
DEBUG level will produce more that 3 times as many data
as the data file you're importing. If you're not a
pgloader contributor or
helping them, well,
debug it, you want to avoid setting the log chatter to
client_encoding will be
pgloader on the PostgreSQL connection it
establish. You can now even set any parameter you want by using the
pg_option_parameter_name magic settings. Note that the command line option
-o for brevity) allows you to override that.
copy_every parameter is set to
5 in the examples, because the test
files are containing less than 10 lines and we want to test several
of commits when using them. So for your real loading, stick to default
10 000 lines per
COPY command), or more. You can play with this
parameter, depending on the network (or local access) and disk system you're
using you might see improvements by reducing it or enlarging it. There's no
so much theory of operation as empirical testing and setting here. For a
one-off operation, just remove the lines from the configuration.
empty_string are related to interpreting the data in
the text or
csv files you have, and the documentation is quite clear about
them. Note that you have global setting and per-section setting too.
The last parameter of this example,
max_parallel_sections, is detailed later
in the article.
global section come as many sections as you have file to load.
template sections, that are only there so that you can share a
bunch of parameters in more than one section. Picture a series of data file
all of the same format, the only thing that will change is the
Use a template section in this case!
Let's see an example:
[simple_tmpl] template = True format = text datestyle = dmy field_sep = | trailing_sep = True [simple] use_template = simple_tmpl table = simple filename = simple/simple.data columns = a:1, b:3, c:2 skip_head_lines = 2 # those reject settings are defaults one reject_log = /tmp/simple.rej.log reject_data = /tmp/simple.rej [partial] table = partial format = text filename = partial/partial.data field_sep = % columns = * only_cols = 1-3, 5
That's 2 of the examples from the examples/pgloader.conf file, in 3 sections so that we see one template example. Of course, having a single section using the template, it's just here for the example.
data file format
The most important setting that you have to care about is the file format.
Your choice here is either
fixed. Mostly, what we are given
csv. You might remember having read that the nice thing about
standards is that there's so many to choose from... well, the
csv land is
one where it's pretty hard to find different producers that understand it
the same way.
So when you fail to have pgloader load your
mostly csv files with a
setup, it's time to consider using
text instead. The
text file format
accept a lot of tunables to adapt to crazy situations, but is all
code when the
python csv module is a C-coded module, more efficient.
If you're wondering what kind of format we're talking about here, here's the
cluttered pgloader example for your reading pleasure, using
^ (carret) as
the field separator:
1^some multi\ line text with\ newline escaping^and some other data following^ 2^and another line^clean^ 3^and\ a last multiline\ escaped line with a missing\ escaping^just to test^ 4^\ ^empty value^ 5^^null value^ 6^multi line\ escaped value\ \ with empty line\ embeded^last line^
And here's what we get by loading that:
pgloader/examples$ pgloader -c pgloader.conf -s cluttered Table name | duration | size | copy rows | errors ==================================================================== cluttered | 0.193s | - | 6 | 0 pgloader/examples$ psql pgloader -c "table cluttered;" a | b | c ---+-------------------------------+------------------ 1 | and some other data following | some multi : line text with : newline escaping 2 | clean | and another line 3 | just to test | and : a last multiline : escaped line : with a missing : escaping 4 | empty value | 5 | null value | 6 | last line | multi line : escaped value : : with empty line : embeded (6 rows)
So when you have such kind of data, well, it might be that
pgloader is still
able to help you!
Please refer to the pgloader man page to know about each and every parameter that you can define and the values accepted, etc. And the fixed data format is to be used when you're not given a field separator but field positions in the file. Yes, we still encounter those from time to time. Who needs variable size storage, after all?