A long time ago we talked about how to Import fixed width data with pgloader, following up on other stories still online at Postgres OnLine Journal and on David Fetter’s blog.
Back then, I showed that using pgloader made it easier to import the data,
but also showed quite poor performances characteristics due to using the
debug
mode in the timings. Let’s update that article with
current pgloader wonders!
Redoing the python based test
Let’s be fair, hardware did evolve in those past 3 years, and the test that ran in 14 seconds was done with debug information level, which is the wrong way to do tests.
$ ~/dev/pgloader-v2/pgloader.py -R. -Tsc census.conf
Table name | duration | size | copy rows | errors
====================================================================
fixed | 1.834s | - | 25375 | 0
I got timings anywhere betseen 1.5 seconds and 1.834 seconds here.
The new pgloader
Now with the current version of pgloader, what do we get:
$ pgloader.exe census-place.load
2013-11-18T12:02:35.001000+01:00 LOG Starting pgloader, log system is ready.
2013-11-18T12:02:35.003000+01:00 LOG Parsing commands from file "/Users/dim/dev/pgloader/test/census-places.load"
table name read imported errors time
------------------------------ --------- --------- --------- --------------
download 0 0 0 1.587s
extract 0 0 0 1.010s
before load 0 0 0 0.014s
------------------------------ --------- --------- --------- --------------
places 25375 25375 0 0.366s
------------------------------ --------- --------- --------- --------------
Total import time 25375 25375 0 2.977s
So the loading itself took as much as 366 milliseconds to run. To be fair that’s kind of a best run, with run times varying between that and 700 milliseconds.
So the new version is about 3 to 9 times faster depending on the story you want to tell. Let’s stick with much faster for the sake of this article.
The command
The new loader takes a full command as its input, rather than a configuration file. Here’s what the command I’ve used this time looks like:
LOAD ARCHIVE
FROM http://www.census.gov/geo/maps-data/data/docs/gazetteer/places2k.zip
INTO postgresql:///pgloader
BEFORE LOAD DO
$$ drop table if exists places; $$,
$$ create table places
(
usps char(2) not null,
fips char(2) not null,
fips_code char(5),
loc_name varchar(64)
);
$$
LOAD FIXED
FROM FILENAME MATCHING ~/places2k.txt/
WITH ENCODING latin-1
(
-- name start length
usps 0 2,
fips 2 2,
fips_code 4 5,
loc_name 9 64,
p 73 9,
h 82 9,
land 91 14,
water 105 14,
ldm 119 14,
wtm 131 14,
lat 143 10,
long 153 11
)
INTO postgresql:///pgloader?places
(
usps, fips, fips_code,
loc_name text using (right-trim loc_name)
);
First, the URL used in the blog post of april 2010 is no longer valid. You can find a list of other interesting data at the Census 2000 Gazetteer Files page, and of course you have more recent version of the data available. In another format entirely, this time tab-based csv-like, so better for general usage, but not for this test where I wanted to reuse the same data source as 3 years ago.
What we can see in that command is that pgloader will actually download the zip archive file from its http source URL, unzip it locally then work on the filename from the archive matching the one we know about: we don’t want to hardcode in the command the name of the directory contained in the zip file.
Also, contrary to the previous version, it’s quite easy to just trim the
loc_name
column as we load the data. Here I’ve been adding a new function to
do that, because I wanted to play with optimizing it (adding type
declarations and inlining it), but the loading works about as well with just
the following (just timed 3 runs at
0.771s
,
0.654s
and
0.862s
) :
INTO postgresql:///pgloader?places
(
usps, fips, fips_code,
loc_name text using (string-right-trim " " loc_name)
);
The string-right-trim function is part of the Common Lisp Standard. The optimisation here looks like:
(declaim (inline right-trim))
(defun right-trim (string)
"Remove whitespaces at end of STRING."
(declare (type simple-string string))
(string-right-trim '(#\Space) string))
Note that you could be providing that definition in your own
trim.lisp
file
and provide it using the
--load trim.lisp
command line option, pgloader
would then compile that to machine code for you before processing your data
file.
Conclusion
If you’re already using pgloader, you will enjoy the new version of it! The new version comes with a command line option to migrate the old configuration file into a command string, making upgrades even easier.
Of course, if you’re interested, consider giving the release candidate a try, it’s available on the pgloader github repository already.