Friday, August 02 2013
PostgreSQL, KNN, Extensions, YeSQL

The Most Popular Pub Names

In his article titled The Most Popular Pub Names Ross Lawley did show us how to perform some quite interesting geographic queries against MongoDB, using some nice Open Data found at the Open Street Map project.

The Open Street Map project publishes a lot of information!

I found the idea behind that article really neat: using easily accessible data produced by an Open Source project to show off some nice queries with real data is what we should do more often. Also, as a PostgreSQL guy I couldn't help but feel distracted by the query language used in that article and thinking that it would be so much simpler to express the same queries in SQL.

The idea behind SQL is that the syntax is made for the queries to be easy enough to read and maintain while being insanely powerful at times. My take on SQL is that it's often enough the easiest way to express your queries, and that even if it can get crazy complex at times, that's only because SQL has the power you need to pass all your complexity down to it.

At least that's my thinking, and this article is my try at sharing this viewpoint with you.

Loading the data

The data itself is available in some kind of an XML format where they managed to handle the data in a EAV model:

<node id="262706" lat="51.0350300" lon="-0.7251785">
  <tag k="amenity" v="pub"/>
  <tag k="created_by" v="Potlatch 0.10f"/>
  <tag k="name" v="Kings Arms"/>
</node>

For the sake of this article, we're going to use the simplest schema possible. I didn't want to try and see if the id actually is unique and never omitted, for example, so here's the schema I've been working with:

create table if not exists pubnames (id bigint, pos point, name text);

Where the MongoDB article used imposm python library to load the data, I wanted to take the opportunity to loading it as a stream: a SAX like API to read the XML should allow to send the data as we parse it in a COPY stream, right?

Here's a slightly edited portion of the code I've been using to parse and load the data, available as the pubnames project on GitHub:

(defun parse-osm-file (&key
			 (pathname *pub-xml-pathname*)
			 (truncate t)
			 (drop nil))
  "Parse the given PATHNAME file, formated as OSM XML."
  (maybe-create-postgresql-table :drop drop :truncate truncate)

  (klacks:with-open-source (s (cxml:make-source pathname))
    (loop
       with stream =
	 (cl-postgres:open-db-writer (remove :port *pgconn*) *pg-table-name* nil)
       for key = (klacks:peek s)
       while key
       do
	 (case key
	   (:start-element (parse-osm-start-element s))
           ; parse-osm-end-element calls cl-postgres:db-write-row
	   (:end-element   (parse-osm-end-element s stream)))
	 (klacks:consume s)

       finally (return (cl-postgres:close-db-writer stream)))))

Given that code, we can parse the data in the XML file and load it into our PostgreSQL table in about 5.5 seconds on my laptop. If I had to optimize that loading time I'd try having two concurrent threads, one of them reading from the XML file and pushing to a queue, and the other one pulling from the same queue and filling our COPY buffer.

Normalizing the data

The first query of our reference article The Most Popular Pub Names shows the python code they've been using in order to normalize the data so that it's then possible to list The Most Popular Pub Names in the United Kingdom. Here, we didn't process the OSM data at all, so what about normalizing it directly within a SQL query?

#  select array_to_string(array_agg(distinct(name) order by name), ', '),
          count(*)
     from pubnames
 group by replace(replace(name, 'The ', ''), 'And', '&')
 order by 2 desc
 limit 5;
       array_to_string        | count 
------------------------------+-------
 Red Lion, The Red Lion       |   350
 Royal Oak, The Royal Oak     |   287
 Crown, The Crown             |   204
 The White Hart, White Hart   |   180
 The White Horse, White Horse |   163
(5 rows)

Time: 152.786 ms

The array_to_string function allows us to tweak the output to our convenience, as the array_agg(distinct(name) order by name) aggregate is doing all the work for us here, in grouping all names together and keeping an ordered set of a unique entry per variant.

Which names do we group together will you ask me? Well, those having the same name apart from some spelling variants: we don't want to consider The to be a difference so we replace it with an empty string, and we do want to consider both And and & as the same thing too.

Again, I'm reproducing the same processing as with the MongoDB article.

Geolocating nearest pub (KNN search)

The spelling of the KNN search in PostgreSQL involves ordering the result set with a distance operator, which is itself spelled <->. Here's the full SQL for searching the pubs nearby our position, or actually the position given as an example in the MongoDB article:

#  select id, name, pos
     from pubnames
 order by pos <-> point(51.516,-0.12)
    limit 3;

     id     |          name          |           pos           
------------+------------------------+-------------------------
   21593238 | All Bar One            | (51.5163499,-0.1192746)
   26848690 | The Shakespeare's Head | (51.5167871,-0.1194731)
  371049718 | The Newton Arms        | (51.5163032,-0.1209811)
(3 rows)

Time: 18.679 ms

As we're using the point datatype in PostgreSQL, there's no simple way that I know of to convert that distance into something like meters or maybe yards here. That's of course possible to do, even considering the actual shape of the earth, thanks to some PostgreSQL Extensions such as earthdistance or the full blown PostGIS. The details about that are for another article though.

PostgreSQL has a very rich and powerful datatype system that goes well beyond storing numbers, text and dates. If you're not familiar with that idea, you should read about it, maybe beginning with PostgreSQL Data Types chapter of the documentation.

Using a KNN specific index

With a dataset of 27878 rows having an answer in about 20ms is not a great achievement. Indeed, we didn't create any indexing whatsoever on the table yet, so the query planner has no other choice but to scan the whole content on disk and filter it as it goes.

It would be way better for performances if we could instead evaluate our query constraints (here, the ORDER BY and LIMIT clauses) using some index search instead.

That's exactly the kind of situation that GiST and SP-GiST indexes have been designed to be able to solve for you in PostgreSQL, and in particular the KNN GiST support. Let's have a try at it:

# create index on pubnames using gist(pos);

# select id, name, pos from pubnames order by pos <-> point(51.516,-0.12) limit 3;
     id     |          name          |           pos           
------------+------------------------+-------------------------
   21593238 | All Bar One            | (51.5163499,-0.1192746)
   26848690 | The Shakespeare's Head | (51.5167871,-0.1194731)
  371049718 | The Newton Arms        | (51.5163032,-0.1209811)
(3 rows)

Time: 0.849 ms

Now we talk! With a dataset of 27878 rows in total, finding the 3 nearest pubs in less than a millisecond is something we can actually be happy with, and can use directly in a web application. I would expect this performances to remain in the right ballpark even for a much larger dataset, and leave it as an exercise for you to find that dataset and test the KNN GiST indexes on it!

Conclusion

PostgreSQL at the center of your dataverse

What I want to take home from this article is the idea that the plain old SQL language still has lots to offer to modern data analysis needs, in particular when you're using PostgreSQL.

That database system knows how to stay relevant in a fast evolving environment, where your needs are more and more demanding: more data, more analysis, more users.

The past few releases of PostgreSQL each come with plenty of new features to better support your demanding use cases. We do a solid new release each year, and you can check the feature matrix to see by yourself the amazing pace at which we are able to improve our system.

If you're using PostgreSQL you have very few reasons to look for another solution. Some cases of course are still best handled in system more tolerant of data loss for example. When that happen though, in my experience, it's always a complementary service that will run alongside PostgreSQL. And for them to coexist peacefully, we even offer you Foreign Data Wrappers!