Thanks to the Postgres Weekly issue #89 and a post to Hacker News front page (see Pgloader: A High-speed PostgreSQL Swiss Army Knife, Written in Lisp it well seems that I just had my first Slashdot effect...
PostgreSQL comes with an awesome bulk copy protocol and tooling best known
\copy commands. Being a transactional system, PostgreSQL
COPY implementation will
ROLLBACK any work done if a single error is found
in the data set you're importing. That's the reason why
started: it provides with error handling for the
A couple of week ago I had the chance to participate into the PGConf NYC 2014 Conference, one of the biggest conferences about PostgreSQL worldwide.
Last week some PostgreSQL users, contributors and advocates have organized a really great conference in Stockholm, Sweden, where I had the please to give the following talk:
In our previous article Aggregating NBA data, PostgreSQL vs MongoDB we spent time comparing the pretty new MongoDB Aggregation Framework with the decades old SQL aggregates. Today, let's showcase more of those SQL aggregates, producing a nice histogram right from our SQL console.
When reading the article Crunching 30 Years of NBA Data with MongoDB Aggregation I coulnd't help but think that we've been enjoying aggregates in SQL for 3 or 4 decades already. When using PostgreSQL it's even easy to actually add your own aggregates given the SQL command create aggregate.
A long time ago we talked about how to
Import fixed width data with pgloader, following up on other stories still
Postgres OnLine Journal and on
David Fetter's blog. Back then, I
showed that using pgloader made it easier to import the data, but also
showed quite poor performances characteristics due to using the
in the timings. Let's update that article with current
As presented at the PostgreSQL Conference Europe the new version of pgloader is now able to fully migrate a MySQL database, including discovering the schema, casting data types, transforming data and default values. Sakila is the traditional MySQL example database, in this article we're going to fully migrate it over to PostgreSQL.
Last week I had the pleasure to present two talks at the awesome PostgreSQL Conference Europe. The first one was actually a tutorial about Writing & using Postgres Extensions where we spent 3 hours on what are PostgreSQL Extensions, what you can expect from them, and how to develop a new one. Then I also had the opportunity to present the new version of pgloader in a talk about Migrating from MySQL to PostgreSQL.
In our Tour of Extensions today's article is about advanced tag indexing. We have a great data collection to play with and our goal today is to be able to quickly find data matching a complex set of tags. So, let's find out those lastfm tracks that are tagged as blues and rhythm and blues, for instance.
PostgreSQL is an all round impressive Relational DataBase Management System which implements the SQL standard (see the very useful reference page Comparison of different SQL implementations for details). PostgreSQL also provides with unique solutions in the database market and has been leading innovation for some years now. Still, there's no support for Autonomous Transactions within the server itself. Let's have a look at how to easily implement them with PL/Proxy.
Let's get back to our Tour of Extensions that had to be kept aside for awhile with other concerns such as last chance PostgreSQL data recovery. Now that we have a data loading tool up to the task (read about it in the Loading Geolocation Data article) we're going to be able to play with the awesome ip4r extension from RhodiumToad.
Last Friday I had the chance to be speaking at the Open World Forum in the NewSQL track, where we had lots of interest and excitement around the NoSQL offerings. Of course, my talk was about explaining how PostgreSQL is Web Scale with some historical background and technical examples about what this database engine is currently capable of.
In our previous article about Loading Geolocation Data, we did load some data into PostgreSQL and saw the quite noticable impact of a user transformation. As it happens, the function that did the integer to IP representation was so naive as to scratch the micro optimisation itch of some Common Lisp hackers: thanks a lot guys, in particular stassats who came up with the solution we're seeing now.
The following story is only interesting to read if you like it when bad things happen, or if you don't have a trustworthy backup policy in place. By trustworthy I mean that each backup you take must be tested with a test recovery job. Only tested backups will prove useful when you need them. So go read our Backup and Restore documentation chapter then learn how to setup Barman for handling physical backups and Point In Time Recovery. Get back when you have proper backups, including recovery testing in place. We are waiting for you. Back? Ok, let's see how bad you can end up without backups, and how to still recover. With luck.
In our ongoing Tour of Extensions we played with earth distance in How far is the nearest pub? then with hstore in a series about trigger, first to generalize Trigger Parameters then to enable us to Auditing Changes with Hstore. Today we are going to work with pg_trgm which is the trigrams PostgreSQL extension: its usage got seriously enhanced in recent PostgreSQL releases and it's now a poor's man Full Text Search engine.
In a previous article about Trigger Parameters we have been using the extension hstore in order to compute some extra field in our records, where the fields used both for the computation and for storing the results were passed in as dynamic parameters. Today we're going to see another trigger use case for hstore: we are going to record changes made to our tuples.
Sometimes you want to compute values automatically at
INSERT time, like for
duration column out of a
start and an
end column, both
timestamptz. It's easy enough to do with a
BEFORE TRIGGER on your table.
What's more complex is to come up with a parametrized spelling of the
trigger, where you can attach the same
stored procedure to any table even
when the column names are different from one another.
There was SQL before window functions and SQL after window functions: that's how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.
About the only time when I will accept to work with MySQL is when you need help to migrate away from it because you decided to move to PostgreSQL instead. And it's already been too much of a pain really, so after all this time I began consolidating what I know about that topic and am writing a software to help me here. Consider it the MySQL Migration Toolkit.
In our recent article about The Most Popular Pub Names we did have a look at how to find the pubs nearby, but didn't compute the distance in between that pub and us. That's because how to compute a distance given a position on the earth expressed as longitude and latitude is not that easy. Today, we are going to solve that problem nonetheless, thanks to PostgreSQL Extensions.
Those days feel really lucky to me. I'm currently visiting friends and customers in San Francisco, and really enjoying my trip here! Of course Josh Berkus took the opportunity to organise a SFPUG meeting and I had the pleasure of being the speaker over there.
Last week was held the CHAR(13) conference in a great venue in the UK countryside. Not only did we discover UK under good weather conditions and some local beers, we also did share a lot of good ideas!
In a recent article here we've been talking about how do do Batch Updates in a very efficient way, using the Writable CTE features available in PostgreSQL 9.1. I sometime read how Common Table Expressions changed the life of fellow DBAs and developers, and would say that Writable CTE are at least the same boost again.
In a recent article Craig Kerstiens from Heroku did demo the really useful crosstab extension. That function allows you to pivot a table so that you can see the data from different categories in separate columns in the same row rather than in separate rows. The article from Craig is Pivoting in Postgres.
Recently I've been to some more conferences and didn't take the time to blog about them, even though I really did have great fun over there. So I felt I should take some time and report about my experience at those conferences. And of course, some more is on the way, as the PostgreSQL Conference Tour gets busier each year it seems.
Tonight I had the pleasure to present a talk at the Dublin PostgreSQL User Group using remote technologies. The talk is about how to make the most ouf of PostgreSQL when using SQL as a developer, and tries to convince you to dive into mastering SQL by showing how to solve an application example all in SQL, using window functions and common table expressions.
In this article, we want to find the town with the greatest number of inhabitants near a given location.
In the previous article here we talked about how to properly update more than one row at a time, under the title Batch Update. We did consider performances, including network round trips, and did look at the behavior of our results when used concurrently.
Performance consulting involves some tricks that you have to teach over and over again. One of them is that SQL tends to be so much better at dealing with plenty of rows in a single statement when compared to running as many statements, each one against a single row.
In the article from yesterday we talked about PostgreSQL HyperLogLog with some details. The real magic of that extension has been skimmed over though, and needs another very small article all by itself, in case you missed it.
If you've been following along at home the newer statistics developments,
you might have heard about this new
State of The Art Cardinality Estimation Algorithm called
technique is now available for PostgreSQL in the extension
https://github.com/aggregateknowledge/postgresql-hll and soon
to be in
skytools related new today, it's been a while. For those who where at
FOSDEM's talk about
Implementing High Availability you might have heard
that I really like working with
PGQ. A new version has been released a while
ago, and the most recent verion is now
3.1.3, as announced in the
Skytools 3.1.3 email.
This year's FOSDEM has been a great edition, in particular the FOSDEM PGDAY 2013 was a great way to begin a 3 days marathon of talking about PostgreSQL with people not only from our community but also from plenty other Open Source communities too: users!
pgloader is a tool to help loading data into
PostgreSQL, adding some error
management to the
COPY is the fast way of loading data into
PostgreSQL and is transaction safe. That means that if a single error
appears within your bulk of data, you will have loaded none of it.
will submit the data again in smaller chunks until it's able to isolate the
bad from the good, and then the good is loaded in.
Another day, another migration from MySQL to PostgreSQL... or at least that's how it feels sometimes. This time again I've been using some quite old scripts to help me do the migration.
In a recent article titled Inline Extensions we detailed the problem of how to distribute an extension's package to a remote server without having access to its file system at all. The solution to that problem is non trivial, let's say. But thanks to the awesome PostgreSQL Community we finaly have some practical ideas on how to address the problem as discussed on pgsql-hackers, our development mailing list.
We've been having the
CREATE EXTENSION feature in
PostgreSQL for a couple of
releases now, so let's talk about how to go from here. The first goal of the
extension facility has been to allow for a clean
restore process of
contrib modules. As such it's been tailored to the needs of deploying files
file system because there's no escaping from that when you have to
executable files, those infamous
As Guillaume says, we've been enjoying a great evening conference in Lyon 2 days ago, presenting PostgreSQL to developers. He did the first hour presenting the project and the main things you want to know to start using PostgreSQL in production, then I took the opportunity to be talking to developers to show off some SQL.
Last week was
PostgreSQL Conference Europe 2012 in Prague, and it's been
awesome. Many thanks to the organisers who did manage to host a very smooth
290 attendees, including speakers. That means you kept
walking into interesting people to talk to, and in particular the
Track has been a giant success.
It's been a long time since I last had some time to spend on the
PostgreSQL extension and its
prefix_range data type. With PostgreSQL 9.2
out, some users wanted me to update the extension for that release, and
hinted me that it was high time that I fix that old bug for which I already
had a patch.
I've been given a nice puzzle that I think is a good blog article opportunity, as it involves some thinking and window functions.
PostgreSQL 9.2 is released! It's an awesome new release that I urge you to
consider trying and adopting, an upgrade from even
9.1 should be very well
worth it, as your hardware could suddenly be able to process a much higher
load. Indeed, better performances mean more work done on the same budget,
that's the name of the game!
The french PostgreSQL Conference, pgday.fr, was yesterday in Lyon. We had a very good time and a great schedule with a single track packed with 7 talks, addressing a diverse set of PostgreSQL related topics, from GIS to fuzzy logic, including replication.
Last week was the annual PostgreSQL Hackers gathering in Canada, thanks to the awesome pgcon conference. This year's issue has been packed with good things, beginning with the Cluster Summit then followed the next day by the Developer Meeting just followed (yes, in the same day) with the In Core Replication Meeting. That was a packed shedule!
Now that you're all using the wonders of Cooperative Consumers to help you efficiently and reliably implement your business constraints and offload them from the main user transactions, you're reaching a point where you have to clean up your development environment (because that's what happens to development environments, right?), and you want a way to start again from a clean empty place.
While working a new
PostgreSQL architecture for an high scale project that
used to be in the top 10 of internet popular web sites (in terms of
visitors), I needed to be able to off load some processing from the main
path: that's called a
batch job. This needs to be
transactional: don't run
the job if we did
rollback; the transaction, process all
events that were
part of the same transaction in the same transaction, etc.
PostgreSQL 9.1 includes proper extension support, as you might well know if you ever read this very blog here. Some hosting facilities are playing with PostgreSQL at big scale (hello Heroku!) and still meet with small caveats making their life uneasy.
It seems that if you search for a
munin plugin for
pgbouncer it's easy
enough to reach an old page of mine with an old version of my plugin, and a
broken link. Let's remedy that by publishing here the newer version of the
plugin. To be honest, I though it already made its way into the official
1.4 set of plugins, but I've not been following closely enough.
Another great conference took place last week, PostgreSQL Conference Europe 2011 was in Amsterdam and plenty of us PostgreSQL geeks were too. I attended to lot of talks and did learn some more about our project, its community and its features, but more than that it was a perfect occasion to meet with the community.
I've been asked about my opinion on backup strategy and best practices, and it so happens that I have some kind of an opinion on the matter.
In the news recently stored procedures where used as an excuse for moving away logic from the database layer to application layer, and to migrate away from a powerful technology to a simpler one, now that there's no logic anymore in the database.
The next PostgreSQL conference is approaching very fast now, I hope you have your ticket already: it's a very promissing event! If you want some help in deciding whether to register or not, just have another look at the schedule. Pick the talks you want to see. It's hard, given how packed with good ones the schedule is. When you're mind is all set, review the list. Registered?
PostgreSQL Hackers mailing lists,
Andrew Dunstan just proposed some
new options for
pg_restore to ease our lives. One of the
answers was talking about some scripts available to exploit the
listing that you play with using options
-L, or the long name
pg_staging tool allows you to easily
exploit those lists too.
You can find skytools3 in debian experimental already, it's in release candidate status. What's missing is the documentation, so here's an idea: I'm going to make a blog post series about skytools next features, how to use them, what they are good for, etc. This first article of the series will just list what are those new features.
Back to our series about pgloader. The previous articles detailed How To Use PgLoader then How to Setup pgloader, then what to expect from a parallel pgloader setup. This article will detail how to reformat input columns so that what PostgreSQL sees is not what's in the data file, but the result of a transformation from this data into something acceptable as an input for the target data type.
When you do partition your tables monthly, then comes the question of when to create next partitions. I tend to create them just the week before next month and I have some nice nagios scripts to alert me in case I've forgotten to do so. How to check that by hand in the end of a month?
CHAR(11) finished somewhen in the night leading to today, if you consider the social events to be part of it, which I definitely do. This conference has been a very good one, both on the organisation side of things and of course for its content.
We still have this problem to solve with extensions and their packaging.
How to best organize things so that your extension is compatible with before
9.1 and following releases of
It's this time of the year again, the main international PostgreSQL Conference is next week in Ottawa, Canada. If previous years are any indication, this will be great event where to meet with a lot of the members of your community. The core team will be there, developers will be there, and we will meet with users and their challenging use cases.
Let's say you need to
ALTER TABLE foo ALTER COLUMN bar TYPE bigint;, and
PostgreSQL is helpfully telling you that no you can't because such and such
views depend on the column. The basic way to deal with that is to copy
paste from the error message the names of the views involved, then prepare a
script wherein you first
DROP VIEW ...; then
ALTER TABLE and finally
VIEW again, all in the same transaction.
While currently too busy at work to deliver much Open Source contributions,
let's debunk an old habit of
PostgreSQL extension authors. It's all down to
copy pasting from
contrib, and there's no reason to continue doing
this way ever since
I've been working on
skytools3 packaging lately. I've been pushing quite a
lot of work into it, in order to have exactly what I needed out of the box,
after some 3 years of production and experiences with the products. Plural,
yes, because even if
plproxy are siblings to the projets (same
developers team, separate life cycle and releases), then
includes several sub-projects.
If you don't remember about what pg_staging is all about, it's a central console from where to control all your PostgreSQL databases. Typically you use it to manage your development and pre-production setup, where developers ask you pretty often to install them some newer dump from the production, and you want that operation streamlined and easy.
If you've not been following closely you might have missed out on extensions integration. Well, Tom spent some time on the patches I've been preparing for the last 4 months. And not only did he commit most of the work but he also enhanced some parts of the code (better factoring) and basically finished it.
This year we were in the main building of the conference, and apparently the booth went very well, solding lots of PostgreSQL merchandise etc. I had the pleasure to once again meet with the community, but being there only 1 day I didn't spend as much time as I would have liked with some of the people there.
A quick blog entry to say that yes:
IRC channel is a good place to be, for all the very good help
you can get there, because people are always wanting to remain helpful,
because of the off-topics discussions sometime, or to get to talk with
community core members. And to start up your day too.
Yeah I'm back on working on my part of the extension thing in PostgreSQL.
The drawback of hosting a static only website is, obviously, the lack of
comments. What happens actually, though, is that I receive very few comments
by direct mail. As I don't get another
spam source to cleanup, I'm left
unconvinced that's such a drawback. I still miss the low probability of
seeing blog readers exchange directly, but I think a
list would be my answer, here...
8.4 came out there was all those comments about how getting
window functions was an awesome addition. Now, it seems that a lot of people
seeking for help in
#postgresql just don't know what kind of problem this
feature helps solving. I've already been using them in some cases here in
this blog, for getting some nice overview about
Partitioning: relation size per “group”.
Although the new asynchronous replication facility that ships with 9.0 ain't
released to the wide public yet, our hackers hero are already working on the
synchronous version of it. A part of the facility is rather easy to design,
we want something comparable to
DRBD flexibility, but specific to our
database world. So
synchronous would either mean
depending on what you need the
standby to have already done when the master
COMMIT. Let's call that the
After discovering the excellent
Gwene service, which allows you to subscribe
newsgroups to read
RSS content (
commits, etc), I came to
read this nice article about
Happy Numbers. That's a little problem that
fits well an interview style question, so I first solved it yesterday
Emacs Lisp as that's the language I use the most those days.
The idea of the day ain't directly from me, I'm just helping with a very
thin subpart of the problem. The problem, I can't say much about, let's just
assume you want to reduce the storage of
MD5 in your database, so you want
bit strings. A solution to use them works fine, but the datatype is
still missing some facilities, for example going from and to hexadecimal
representation in text.
We're using constants in some constraints here, for example in cases where several servers are replicating to the same federating one: each origin server has his own schema, and all is replicated nicely on the central host, thanks to Londiste, as you might have guessed already.
In trying to help an extension debian packaging effort, I've once again proposed to handle it. That's because I now begin to know how to do it, as you can see in my package overview page at debian QA facility. There's a reason why I proposed myself here, it's that yet another tool of mine is now to be found in debian, and should greatly help extension packaging there. You can already check for the postgresql-server-dev-all package page if you're that impatient!
Some user on
IRC was reading the releases notes in order to plan for a minor
upgrade of his
8.3.3 installation, and was puzzled about potential needs for
GIST indexes. That's from the
8.3.5 release notes, and from the
8.3.8 notes you see that you need to consider
hash indexes on
columns too. Now the question is, how to find out if any such beasts are in
use in your database?
Today I'm being told once again about SQLite as an embedded database software. That one ain't a database server but a software library that you can use straight into your main program. I'm yet to use it, but it looks like its SQL support is good enough for simple things — and that covers loads of things. I guess read-only cache and configuration storage would be the obvious ones, because it seems that SQLite use cases aren't including mixed concurrency, that is workloads with concurrent readers and writers.
This time, we are trying to figure out where is the bulk of the data on
disk. The trick is that we're using
DDL partitioning, but we want a “nice”
view of size per
partition set. Meaning that if you have for example a
foo with partitions
foo_201007, you would want
to see a single category
foo containing the accumulated size of all the
There's currently a thread on hackers about bg worker: overview and a series of 6 patches. Thanks a lot Markus! This is all about generalizing a concept already in use in the autovacuum process, where you have an independent subsystem that require having an autonomous daemon running and able to start its own workers.
Nowadays to analyze logs and provide insights, the more common tool to use
pgfouine, which does an excellent job. But there has been some
improvements in logs capabilities that we're not benefiting from yet, and
I'm thinking about the
CSV log format.
There's a big trend nowadays about using column storage as opposed to what PostgreSQL is doing, which would be row storage. The difference is that if you have the same column value in a lot of rows, you could get to a point where you have this value only once in the underlying storage file. That means high compression. Then you tweak the executor to be able to load this value only once, not once per row, and you win another huge source of data traffic (often enough, from disk).
It surely does not feel like a full month and some more went by since we were enjoying PGCon 2010, but in fact it was already the time for CHAR(10). The venue was most excellent, as Oxford is a very beautiful city. Also, the college was like a city in the city, and having the accomodation all in there really smoothed it all.
Yes. This pgloader project is still maintained and somewhat active. Development happens when I receive a complaint, either about a bug in existing code or a feature in yet-to-write code. If you have a bug to report, just send me an email!
This time we're having a database where sequences were used, but not systematically as a default value of a given column. It's mainly an historic bad idea, but you know the usual excuse with bad ideas and bad code: the first 6 months it's experimental, after that it's historic.
So, if you followed the previous blog entry, now you have a new database
containing all the
static tables encoded in
UTF-8 rather than
SQL_ASCII. Because if it was not yet the case, you now severely distrust
It happens that you have to manage databases
designed by your predecessor,
and it even happens that the team used to not have a
raisins can lead to having a
SQL_ASCII database. The horror!
So, after restoring a production dump with intermediate filtering, none of our sequences were set to the right value. I could have tried to review the process of filtering the dump here, but it's a one-shot action and you know what that sometimes mean. With some pressure you don't script enough of it and you just crawl more and more.
So I had two
prefix in less than a week. It means several
things, one of them is that my code is getting used in the wild, which is
nice. The other side of the coin is that people do find bugs in there. This
one is about the behavior of the
btree opclass of the type
prefix range. We
cheat a lot there by simply having written one, because a range does not
have a strict ordering: is
[1-3] before of after
[2-4]? But when you know
you have no overlapping intervals in your
prefix_range column, being able to
have it part of a
primary key is damn useful.
So there it is, at long last, the final
1.0.0 release of prefix! It's on its
way into the debian repository (targetting sid, in testing in 10 days) and
I've been having problem with building both
postgresql-8.4-prefix debian packages from the same source package, and
fixing the packaging issue forced me into modifying the main
Makefile. So while reaching
rc2, I tried to think about missing pieces easy
to add this late in the game: and there's one, that's a function
length(prefix_range), so that you don't have to cast to text no more in the
following wildspread query:
I can't really compare PgCon 2009 with previous years versions, last time I enjoyed the event it was in 2006, in Toronto. But still I found the experience to be a great one, and I hope I'll be there next year too!
On the performance mailing list, a recent
thread drew my attention. It
devired to be about using a connection pool software and prepared statements
in order to increase scalability of PostgreSQL when confronted to a lot of
concurrent clients all doing simple
select queries. The advantage of the
pooler is to reduce the number of
backends needed to serve the queries, thus
reducing PostgreSQL internal bookkeeping. Of course, my choice of software
here is clear:
PgBouncer is an excellent top grade solution, performs real
well (it won't parse queries), reliable, flexible.
The prefix project is about matching a literal against prefixes in your table, the typical example being a telecom routing table. Thanks to the excellent work around generic indexes in PostgreSQL with GiST, indexing prefix matches is easy to support in an external module. Which is what the prefix extension is all about.
The problem was raised this week on
IRC and this time again I felt it would
be a good occasion for a blog entry: how to load an
XML file content into a
In this russian page you'll see a nice presentation of Skype databases architectures by Asko Oja himself. It's the talk at Russian PostgreSQL Community meeting, October 2008, Moscow, and it's a good read.
As it happens, I've got some environments where I want to make sure
Heap Only Tuples) is in use. Because we're doing so much updates a second
that I want to get sure it's not killing my database server. I not only
wrote some checking view to see about it, but also made a
about it in the
French PostgreSQL website. Handling around in
means that I'm now bound to write about it in English too!
So, you're using
londiste and the
ticker has not been running all night
long, due to some restart glitch in your procedures, and the
on call admin
didn't notice the restart failure. If you blindly restart the replication
daemon, it will load in memory all those events produced during the night,
at once, because you now have only one tick where to put them all.
This is a test of a fake entry to see how muse will manage this.