Performance consulting involves some tricks that you have to teach over and over again. One of them is that SQL tends to be so much better at dealing with plenty of rows in a single statement when compared to running as many statements, each one against a single row.
Another kind of Batch to update
So when you need to
UPDATE a bunch of rows from a given source, remember
that you can actually use a
JOIN in the update statement. Either the source
of data is already in the database, in which case it's as simple as using
FROM clause in the update statement, or it's not, and we're getting back
to that in a minute.
It's all about using that
FROM clause in an update statement, right?
UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.id
Using that, you can actually update thousands of rows in our target table in a single statement, and you can't really get faster than that.
Preparing the Batch
Now, if you happen to have the source data in your application process' memory, the previous bits is not doing you any good, you think. Well, the trick is that pushing your in-memory data into the database and then joining against the now local source of data is generally faster than looping in the application and having to do a whole network round trip per row.
What about that round trip?
Let's see how it goes:
CREATE TEMP TABLE source(LIKE target INCLUDING ALL) ON COMMIT DROP; COPY source FROM STDIN; UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.id
As we're talking about performances, the trick here is to use the COPY
protocol to fill in the temporary table we just create to hold our data. So
we're now sending the whole data set in a temporary location in the
database, then using that as the
UPDATE source. And that's way faster than
doing a separate
UPDATE statement per row in your batch, even for small
Also, rather than using the SQL
COPY command, you might want to look up the
docs of the PostgreSQL driver you are currently using in your application,
it certainly includes some higher level facilities to deal with pushing the
data into the streaming protocol.
Insert or Update
And now sometime some of the rows in the batch have to be updated while some
others are new and must be inserted. How do you do that? Well, PostgreSQL
9.1 brings on the table
WITH support for all DML queries, which means that
you can do the following just fine:
WITH upd AS ( UPDATE target t SET counter = t.counter + s.counter, FROM source s WHERE t.id = s.id RETURNING s.id ) INSERT INTO target(id, counter) SELECT id, sum(counter) FROM source s LEFT JOIN upd t USING(id) WHERE t.id IS NULL GROUP BY s.id RETURNING t.id
That query here is updating all the rows that are known in both the target and the source and returns what we took from the source in the operation, so that we can do an anti-join in the next step of the query, where we're inserting any row that was not taken care of in the update part of the statement.
Note that when the batch gets to bigger size it's usually better to join
against the target table in the
INSERT statement, because that will have an
index on the join key.
Now, you will tell me that we just solved the
UPSERT problem. Well what
happens if more than one transaction is trying to do the
INSERT dance at the same time? It's a single statement, so it's a single
snapshot. What can go wrong?
What happens is that as soon as the concurrent sources contain some data for
the same primary key, you get a duplicate key error on the insert. As both
the transactions are concurrent, they are seeing the same target table where
the new data does not exists, and both will conclude that they need to
INSERT the new data into the target table.
There are two things that you can do to avoid the problem. The first thing is to make it so that you're doing only one batch update at any time, by architecting your application around that constraint. That's the most effective way around the problem, but not the most practical.
The other thing you can do, is force the concurrent transactions to serialize one after the other, using an explicit locking statement:
LOCK TABLE target IN SHARE ROW EXCLUSIVE MODE;
That lock level is not automatically acquired by any PostgreSQL command, so the only way it helps you is when you're doing that for every transaction you want to serialize. When you know you're not at risk (that is, when not playing the insert or update dance), you can omit taking that lock.
The SQL language has its quirks, that's true. It's been made for efficient data processing, and with recent enough PostgreSQL releases you even have some advanced pipelining facilities included in the language. Properly learning how to make the most out of that old component of your programming stack still makes a lot of sense today!