So, following previous blog entries about importing fixed width data, from Postgres Online Journal and David (perl) Fetter, I couldn't resist following the meme and showing how to achieve the same thing with pgloader.
I can't say how much I dislike such things as the following, and I can't help thinking that non IT people are right looking at us like this when encountering such prose.
map {s/\D*(\d+)-(\d+).*/$a.="A".(1+$2-$1). " "/e} split(/\n/,<<'EOT');
So, the pgloader way. First you need to have setup a database, I called it
pgloader here. Then you need the same CREATE TABLE as on the original
article, here is it for completeness:
CREATE TABLE places(usps char(2) NOT NULL,
fips char(2) NOT NULL,
fips_code char(5),
loc_name varchar(64));
Now the data file I've taken here: http://www.census.gov/tiger/tms/gazetteer/places2k.txt.
Then we translate the file description into pgloader setup:
[pgsql] host = localhost port = 5432 base = pgloader user = dim pass = None log_file = /tmp/pgloader.log log_min_messages = DEBUG client_min_messages = WARNING client_encoding = 'latin1' lc_messages = C pg_option_standard_conforming_strings = on [fixed] table = places format = fixed filename = places2k.txt columns = * fixed_specs = usps:0:2, fips:2:2, fips_code:4:5, loc_name:9:64, p:73:9, h:82:9, land:91:14, water:105:14, ldm:119:14, wtm:131:14, lat:143:10, long:153:11
We're ready to import the data now:
dim ~/PostgreSQL/examples pgloader -vsTc pgloader.conf
pgloader INFO Logger initialized
pgloader WARNING path entry '/usr/share/python-support/pgloader/reformat' does not exists, ignored
pgloader INFO Reformat path is []
pgloader INFO Will consider following sections:
pgloader INFO fixed
pgloader INFO Will load 1 section at a time
fixed INFO columns = *, got [('usps', 1), ('fips', 2), ('fips_code', 3), ('loc_name', 4)]
fixed INFO Loading threads: 1
fixed INFO closing current database connection
fixed INFO fixed processing
fixed INFO TRUNCATE TABLE places;
pgloader INFO All threads are started, wait for them to terminate
fixed INFO COPY 1: 10000 rows copied in 5.769s
fixed INFO COPY 2: 10000 rows copied in 5.904s
fixed INFO COPY 3: 5375 rows copied in 3.187s
fixed INFO No data were rejected
fixed INFO 25375 rows copied in 3 commits took 14.907 seconds
fixed INFO No database error occured
fixed INFO closing current database connection
fixed INFO releasing fixed semaphore
fixed INFO Announce it's over
Table name | duration | size | copy rows | errors
====================================================================
fixed | 14.901s | - | 25375 | 0
Note the -T option is for TRUNCATE, which you only need when you want to
redo the loading, I've come to always mention it in interactive usage. The
-v option is for some more verbosity and the -s for the summary at end of
operations.
With the pgloader.conf and places2k.txt in the current directory, and an
empty table, just typing in pgloader at the prompt would have done the job.
Oh, the pg_option_standard_conforming_strings bit is from the git HEAD, the
current released version has no support for setting any PostgreSQL knob
yet. Still, it's not necessary here, so you can forget about it.
You will also notice that pgloader didn't trim the data for you, which ain't funny for the places column. That's a drawback of the fixed width format that you can work on two ways here, either by means of
UPDATE places SET loc_name = trim(loc_name) ;or a custom reformat module for pgloader. I guess the latter solution is overkill, but it allows for pipe style processing of the data and a single database write.
Send me a mail if you want me to show here how to setup such a reformatting module in a next blog entry!
