Continuing our series of PostgreSQL Data Types today we’re going to introduce the PostgreSQL Point type.
In order to put the Point datatype in a context where it makes sense, we’re going to download a complete geolocation data set and normalize it, thus making good use of both the normalization good practice and those other PostgreSQL data types we’ve been learning about in the previous articles of this series.
Buckle-up, this is a long article with a lot of SQL inside.
Table of Contents
Geonames
The GeoNames geographical database covers all countries and contains over eleven million place names that are available for download free of charge.
The website offers online querying and all the data is made available to download and use. As is often the case, it comes in an ad-hoc format and requires some processing and normalization before it’s usable in a PostgreSQL database.
begin;
create schema if not exists raw;
create table raw.geonames
(
geonameid bigint,
name text,
asciiname text,
alternatenames text,
latitude double precision,
longitude double precision,
feature_class text,
feature_code text,
country_code text,
cc2 text,
admin1_code text,
admin2_code text,
admin3_code text,
admin4_code text,
population bigint,
elevation bigint,
dem bigint,
timezone text,
modification date
);
create table raw.country
(
iso text,
iso3 text,
isocode integer,
fips text,
name text,
capital text,
area double precision,
population bigint,
continent text,
tld text,
currency_code text,
currency_name text,
phone text,
postal_code_format text,
postal_code_regex text,
languages text,
geonameid bigint,
neighbours text,
fips_equiv text
);
\copy raw.country from 'countryInfoData.txt' with csv delimiter E'\t'
create table raw.feature
(
code text,
description text,
comment text
);
\copy raw.feature from 'featureCodes_en.txt' with csv delimiter E'\t'
create table raw.admin1
(
code text,
name text,
ascii_name text,
geonameid bigint
);
\copy raw.admin1 from 'admin1CodesASCII.txt' with csv delimiter E'\t'
create table raw.admin2
(
code text,
name text,
ascii_name text,
geonameid bigint
);
\copy raw.admin2 from 'admin2Codes.txt' with csv delimiter E'\t'
commit;
Once we have loaded the raw data from the published files at http://download.geonames.org/export/dump/, we can normalize the content and begin to use the data.
You might notice that the SQL file above is missing the \copy
command for
the raw.geonames table. That’s because copy failed to load the file
properly: some location names include single and double quotes, and those
are not properly quoted… and not properly escaped. So we resorted to
pgloader to load the file, with the following command:
load csv
from /tmp/geonames/allCountries.txt
into pgsql://appdev@/appdev
target table raw.geonames
with fields terminated by '\t',
fields optionally enclosed by '§',
fields escaped by '%',
truncate;
Here’s the summary obtained when loading the dataset on the laptop used to prepare this book:
table name errors rows bytes total time
----------------------- --------- --------- --------- --------------
fetch 0 0 0.009s
----------------------- --------- --------- --------- --------------
raw.geonames 0 11540466 1.5 GB 6m43.218s
----------------------- --------- --------- --------- --------------
Files Processed 0 1 0.026s
COPY Threads Completion 0 2 6m43.319s
----------------------- --------- --------- --------- --------------
Total import time ✓ 3 1.5 GB 6m43.345s
Normalizing our Geonames data model
To normalize the schema, we apply the rules from the definition of the normal forms. Basically, we want to avoid any dependency in between the attributes of our models. Any dependency means that we need to create a separate table where to manage a set of data that makes sense in isolation is managed.
I did a full talk on the topic of Data Modeling, Normalization and Denormalization at pgday Nordic this year in Oslo, and the slides are of course available in the Talks section here. Check-it out!
The raw.geonames table uses several reference data that GeoNames provide as separate downloads. We then need to begin with fixing the reference data used in the model.
Features
The GeoNames model tags all of its geolocation data with a feature class and a feature. The description for those codes are detailed on the GeoNames codes page and available for download in the featureCodes_en.txt file. Some of the information we need is only available in a text form and has to be reported manually.
begin;
create schema if not exists geoname;
create table geoname.class
(
class char(1) not null primary key,
description text
);
insert into geoname.class (class, description)
values ('A', 'country, state, region,...'),
('H', 'stream, lake, ...'),
('L', 'parks,area, ...'),
('P', 'city, village,...'),
('R', 'road, railroad '),
('S', 'spot, building, farm'),
('T', 'mountain,hill,rock,... '),
('U', 'undersea'),
('V', 'forest,heath,...');
create table geoname.feature
(
class char(1) not null references geoname.class(class),
feature text not null,
description text,
comment text,
primary key(class, feature)
);
insert into geoname.feature
select substring(code from 1 for 1) as class,
substring(code from 3) as feature,
description,
comment
from raw.feature
where feature.code <> 'null';
commit;
As we see in this file we have to deal with an explicit ’null’ entry:
there’s a text that is four letters long in the last line (and reads null
)
and that we don’t want to load.
Also, the provided file uses the notation A.ADM1 for an entry of class A and feature ADM1, which we split into proper attributes in our normalization process. The natural key for the geoname.feature table is the combination of the class and the feature.
Once all the data is loaded and normalized, we can get some nice statistics:
select class, feature, description, count(*)
from feature
left join geoname using(class,feature)
group by class, feature
order by count desc
limit 10;
This is a very simple top-10 query, per feature:
class │ feature │ description │ count
═══════╪═════════╪═════════════════╪═════════
P │ PPL │ populated place │ 1711458
H │ STM │ stream │ 300283
S │ CH │ church │ 236394
S │ FRM │ farm │ 234536
S │ SCH │ school │ 223402
T │ HLL │ hill │ 212659
T │ MT │ mountain │ 192454
S │ HTL │ hotel │ 170896
H │ LK │ lake │ 162922
S │ BLDG │ building(s) │ 143742
(10 rows)
Countries
The raw.country table has several normalization issues. Before we list them, having a look at some data will help us:
─[ RECORD 1 ]──────┬─────────────────────────
iso │ FR
iso3 │ FRA
isocode │ 250
fips │ FR
name │ France
capital │ Paris
area │ 547030
population │ 64768389
continent │ EU
tld │ .fr
currency_code │ EUR
currency_name │ Euro
phone │ 33
postal_code_format │ #####
postal_code_regex │ ^(\d{5})$
languages │ fr-FR,frp,br,co,ca,eu,oc
geonameid │ 3017382
neighbours │ CH,DE,BE,LU,IT,AD,MC,ES
fips_equiv │ ¤
The main normalization failures we see are:
-
Nothing guarantees the absence of duplicate rows in the table, so we need to add a primary key constraint.
Here the isocode attribute looks like the best choice, as it’s both unique and an integer.
-
The languages and neighbours attributes both contain multiple-valued content, a comma-separated list of either languages or country codes.
-
To reach 2NF then, all non-key attributes should be dependent on the entire of the key, and the currencies and postal code formats are not dependent on the country.
A good way to check for dependencies on the key attributes is with the following type of query:
select currency_code, currency_name, count(*)
from raw.country
group by currency_code, currency_name
order by count desc
limit 5;
In our dataset, we have the following result, showing 34 countries using the Euro currency:
currency_code │ currency_name │ count
═══════════════╪═══════════════╪═══════
EUR │ Euro │ 34
USD │ Dollar │ 16
AUD │ Dollar │ 8
XOF │ Franc │ 8
XCD │ Dollar │ 8
(5 rows)
In the context of this article, we’re going to pass on the currency, language, and postal code formats of countries and focus on some information only. That gives us the following normalization process:
begin;
create schema if not exists geoname;
create table geoname.continent
(
code char(2) primary key,
name text
);
insert into geoname.continent(code, name)
values ('AF', 'Africa'),
('NA', 'North America'),
('OC', 'Oceania'),
('AN', 'Antarctica'),
('AS', 'Asia'),
('EU', 'Europe'),
('SA', 'South America');
create table geoname.country
(
isocode integer primary key,
iso char(2) not null,
iso3 char(3) not null,
fips text,
name text,
capital text,
continent char(2) references geoname.continent(code),
tld text,
geonameid bigint
);
insert into geoname.country
select isocode, iso, iso3, fips, name,
capital, continent, tld, geonameid
from raw.country;
create table geoname.neighbour
(
isocode integer not null references geoname.country(isocode),
neighbour integer not null references geoname.country(isocode),
primary key(isocode, neighbour)
);
insert into geoname.neighbour
with n as(
select isocode,
regexp_split_to_table(neighbours, ',') as neighbour
from raw.country
)
select n.isocode,
country.isocode
from n
join geoname.country
on country.iso = n.neighbour;
commit;
Note that we add the continent list (for completeness in the region drill down) and then introduce the geoname.neighbour part of the model. Having an association table that links every country with its neighbours on the map (a neighbour has a common border) allows us to easily query for the information:
select neighbour.iso,
neighbour.name,
neighbour.capital,
neighbour.tld
from geoname.neighbour as border
join geoname.country as country
on border.isocode = country.isocode
join geoname.country as neighbour
on border.neighbour = neighbour.isocode
where country.iso = 'FR';
So we get the following list of neighbor countries for France:
iso │ name │ capital │ tld
═════╪═════════════╪══════════════════╪═════
CH │ Switzerland │ Bern │ .ch
DE │ Germany │ Berlin │ .de
BE │ Belgium │ Brussels │ .be
LU │ Luxembourg │ Luxembourg │ .lu
IT │ Italy │ Rome │ .it
AD │ Andorra │ Andorra la Vella │ .ad
MC │ Monaco │ Monaco │ .mc
ES │ Spain │ Madrid │ .es
(8 rows)
Administrative Zoning
The raw data from the GeoNames website then offers an interesting geographical breakdown in the country_code, admin1_code and admin2_code.
select geonameid, name, admin1_code, admin2_code
from raw.geonames
where country_code = 'FR'
limit 5
offset 50;
To get an interesting result set, we select randomly from the data for France, where the code has to be expanded to be meaningful. With a USA based data set, we get states codes as admin1_code (e.g. IL for Illinois), and the necessity for normalized data might then be less visible.
Of course, never use offset in your application queries, as seen on No Offset. Here, we are doing interactive discovery of the data, so it is found acceptable, to some extent, to play with the offset facility.
Here’s the data set we get:
geonameid │ name │ admin1_code │ admin2_code
═══════════╪═════════════════════╪═════════════╪═════════════
2967132 │ Zintzel du Nord │ 44 │ 67
2967133 │ Zinswiller │ 44 │ 67
2967134 │ Ruisseau de Zingajo │ 94 │ 2B
2967135 │ Zincourt │ 44 │ 88
2967136 │ Zimming │ 44 │ 57
(5 rows)
The GeoNames website provides files admin1CodesASCII.txt and admin2Codes.txt for us to use to normalize our data. Those files again use admin codes spelled as AD.06 and AF.01.1125426 where the raw.geonames table uses them as separate fields. That’s a good reason to split them now.
Here’s the SQL to normalize the admin breakdowns, splitting the codes and adding necessary constraints, to ensure data quality:
begin;
create schema if not exists geoname;
create table geoname.region
(
isocode integer not null references geoname.country(isocode),
regcode text not null,
name text,
geonameid bigint,
primary key(isocode, regcode)
);
insert into geoname.region
with admin as
(
select regexp_split_to_array(code, '[.]') as code,
name,
geonameid
from raw.admin1
)
select country.isocode as isocode,
code[2] as regcode,
admin.name,
admin.geonameid
from admin
join geoname.country
on country.iso = code[1];
create table geoname.district
(
isocode integer not null,
regcode text not null,
discode text not null,
name text,
geonameid bigint,
primary key(isocode, regcode, discode),
foreign key(isocode, regcode)
references geoname.region(isocode, regcode)
);
insert into geoname.district
with admin as
(
select regexp_split_to_array(code, '[.]') as code,
name,
geonameid
from raw.admin2
)
select region.isocode,
region.regcode,
code[3],
admin.name,
admin.geonameid
from admin
join geoname.country
on country.iso = code[1]
join geoname.region
on region.isocode = country.isocode
and region.regcode = code[2];
commit;
The previous query can now be rewritten, showing region and district names rather than admin1_code and admin2_code, which we still have internally in case we need them of course.
select r.name, reg.name as region, d.name as district
from raw.geonames r
left join geoname.country
on country.iso = r.country_code
left join geoname.region reg
on reg.isocode = country.isocode
and reg.regcode = r.admin1_code
left join geoname.district d
on d.isocode = country.isocode
and d.regcode = r.admin1_code
and d.discode = r.admin2_code
where country_code = 'FR'
limit 5
offset 50;
The query uses left join operations because we have geo-location data without the admin1 or admin2 levels of details — more on that later. Here’s the same list of French areas, this time with proper names:
name │ region │ district
═════════════════════╪═══════════╪═══════════════════════════════
Zintzel du Nord │ Grand Est │ Département du Bas-Rhin
Zinswiller │ Grand Est │ Département du Bas-Rhin
Ruisseau de Zingajo │ Corsica │ Département de la Haute-Corse
Zincourt │ Grand Est │ Département des Vosges
Zimming │ Grand Est │ Département de la Moselle
(5 rows)
Geolocation Data
Now that we have loaded the reference data, we can load the main geolocation data with the following script. Note that we skip parts of the data we don’t need for this book, but that you might want to load in your application’s background data.
Before loading the raw data into a normalized version of the table, which will make heavy use of the references we normalized before, we have to study and understand how the breakdown works:
select count(*) as all,
count(*) filter(where country_code is null) as no_country,
count(*) filter(where admin1_code is null) as no_region,
count(*) filter(where admin2_code is null) as no_district,
count(*) filter(where feature_class is null) as no_class,
count(*) filter(where feature_code is null) as no_feat
from raw.geonames ;
We have lots of entries without reference for a country, and even more without detailed breakdown (admin1_code and admin2_code are not always part of the data). Moreover we also have points without any reference feature and class, some of them in the Artic.
all │ no_country │ no_region │ no_district │ no_class │ no_feat
══════════╪════════════╪═══════════╪═════════════╪══════════╪═════════
11540466 │ 5821 │ 45819 │ 5528455 │ 5074 │ 95368
(1 row)
Given that, our normalization query must be careful to use left join operations, so as to allow for fields to be null when the foreign key reference doesn’t exist. Be careful to drill down properly to the country, then the region, and only then the district, as the data set contains points of several layers of precision as seen in the query above.
begin;
create table geoname.geoname
(
geonameid bigint primary key,
name text,
location point,
isocode integer,
regcode text,
discode text,
class char(1),
feature text,
population bigint,
elevation bigint,
timezone text,
foreign key(isocode)
references geoname.country(isocode),
foreign key(isocode, regcode)
references geoname.region(isocode, regcode),
foreign key(isocode, regcode, discode)
references geoname.district(isocode, regcode, discode),
foreign key(class)
references geoname.class(class),
foreign key(class, feature)
references geoname.feature(class, feature)
);
insert into geoname.geoname
with geo as
(
select geonameid,
name,
point(longitude, latitude) as location,
country_code,
admin1_code,
admin2_code,
feature_class,
feature_code,
population,
elevation,
timezone
from raw.geonames
)
select geo.geonameid,
geo.name,
geo.location,
country.isocode,
region.regcode,
district.discode,
feature.class,
feature.feature,
population,
elevation,
timezone
from geo
left join geoname.country
on country.iso = geo.country_code
left join geoname.region
on region.isocode = country.isocode
and region.regcode = geo.admin1_code
left join geoname.district
on district.isocode = country.isocode
and district.regcode = geo.admin1_code
and district.discode = geo.admin2_code
left join geoname.feature
on feature.class = geo.feature_class
and feature.feature = geo.feature_code;
create index on geoname.geoname using gist(location);
commit;
Now that we have a proper data set loaded, it’s easier to make sense of the administrative breakdowns and the geo-location data.
The real use case for this data comes later: thanks to the GiST index over the geoname.location column we are now fully equipped to do a names lookup from the geo-localized information.
select continent.name,
count(*),
round(100.0 * count(*) / sum(count(*)) over(), 2) as pct,
repeat('■', (100 * count(*) / sum(count(*)) over())::int) as hist
from geoname.geoname
join geoname.country using(isocode)
join geoname.continent
on continent.code = country.continent
group by continent.name order by continent.name;
We can see that the GeoNames data is highly skewed towards Asia, North America, and then Europe. Of course, the Antartica data is not very dense.
name │ count │ pct │ hist
═══════════════╪═════════╪═══════╪═══════════════════════════════════
Africa │ 1170043 │ 10.14 │ ■■■■■■■■■■
Antarctica │ 21125 │ 0.18 │
Asia │ 3772195 │ 32.70 │ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Europe │ 2488807 │ 21.58 │ ■■■■■■■■■■■■■■■■■■■■■■
North America │ 3210802 │ 27.84 │ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Oceania │ 354325 │ 3.07 │ ■■■
South America │ 517347 │ 4.49 │ ■■■■
(7 rows)
Geolocation GiST Indexing
The previous geoname table creation script contains the following index definition:
create index on geoname.geoname using gist(location);
Such an index is useful when searching for a specific location within our table, which contains about 11.5 million entries. PostgreSQL supports index scan based lookups in several situations, including the kNN lookup, also known as the nearest neighbor lookup.
In the arrays non-relational data type example we loaded a data set of 200,000 geo-localized tweets in the hashtag table. Here’s an extract of this table’s content:
─[ RECORD 1 ]────────────────────────────────────────────────
id │ 720553458596757504
date │ 2016-04-14 10:05:00+02
uname │ Police Calls 32801
message │ #DrugViolation at 335 N Magnolia Ave. #orlpol #opd
location │ (-81.3769794,28.5469591)
hashtags │ {#DrugViolation,#opd,#orlpol}
It’s possible to retrieve more information from the GeoNames data thanks
to the following lateral left join lookup in which we implement a kNN
search with order by ... <-> ... limit k
clause:
select id,
round((hashtag.location <-> geoname.location)::numeric, 3) as dist,
country.iso,
region.name as region,
district.name as district
from hashtag
left join lateral
(
select geonameid, isocode, regcode, discode, location
from geoname.geoname
order by location <-> hashtag.location
limit 1
)
as geoname
on true
left join geoname.country using(isocode)
left join geoname.region using(isocode, regcode)
left join geoname.district using(isocode, regcode, discode)
order by id
limit 5;
The <->
operator computes the distance in between its argument, and by
using the limit 1 clause we select the nearest known entry in the
geoname.geoname table for each entry in the hashtag table.
Then it’s easy to add our normalized GeoNames information from the country, region and district tables. Here’s the result we get here:
id │ dist │ iso │ region │ district
════════════════════╪═══════╪═════╪══════════════╪═════════════════════
720553447402160128 │ 0.004 │ US │ Florida │ Orange County
720553457015324672 │ 0.004 │ US │ Texas │ Smith County
720553458596757504 │ 0.001 │ US │ Florida │ Orange County
720553466804989952 │ 0.001 │ US │ Pennsylvania │ Philadelphia County
720553475923271680 │ 0.000 │ US │ New York │ Nassau County
(5 rows)
To check that our GiST index is actually used, we use the explain
command of PostgreSQL, with the spelling explain (costs off)
followed by
the whole query as above, and we get the following query plan:
\pset format wrapped
\pset columns 70
QUERY PLAN
══════════════════════════════════════════════════════════════════════
Limit
-> Nested Loop Left Join
-> Nested Loop Left Join
-> Nested Loop Left Join
Join Filter: (geoname.isocode = country.isocode)
-> Nested Loop Left Join
-> Index Scan using hashtag_pkey on hasht…
…ag
-> Limit
-> Index Scan using geoname_locatio…
…n_idx on geoname
Order By: (location <-> hashta…
…g.location)
-> Materialize
-> Seq Scan on country
-> Index Scan using region_pkey on region
Index Cond: ((geoname.isocode = isocode) AND (ge…
…oname.regcode = regcode))
-> Index Scan using district_pkey on district
Index Cond: ((geoname.isocode = isocode) AND (geoname.…
…regcode = regcode) AND (geoname.discode = discode))
(16 rows)
The index scan using geoname_location_idx on geoname is clear: the index has been used. On the laptop on which this book has been written, we get the result in about 13 milliseconds.
A Sampling of Countries
This dataset of more than 11 million rows is not practical to include in the book’s material for the Full Edition and Enterprise Edition, where you have a database dump or Docker image to play with. We instead take a random sample of 1% of the table’s content, and here’s how the magic is done:
begin;
create schema if not exists sample;
drop table if exists sample.geonames;
create table sample.geonames
as select /*
* We restrict the “export” to some columns only, so as to
* further reduce the size of the exported file available to
* download with the book.
*/
geonameid,
name,
longitude,
latitude,
feature_class,
feature_code,
country_code,
admin1_code,
admin2_code,
population,
elevation,
timezone
/*
* We only keep 1% of the 11 millions rows here.
*/
from raw.geonames TABLESAMPLE bernoulli(1);
\copy sample.geonames to 'allCountries.sample.copy'
commit;
In this script, we use the tablesample feature of PostgreSQL to only keep a random selection of 1% of the rows in the table. The tablesample accepts several methods, and you can see the PostgreSQL documentation entitled Writing A Table Sampling Method yourself if you need to.
Here’s what the from clause documentation of the select statement has to say about the choice of bernouilli and system, included by default in PostgreSQL:
The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression. (Other sampling methods might accept more or different arguments.) These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table’s rows. The BERNOULLI method scans the whole table and selects or ignores individual rows independently with the specified probability. The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
Running the script, here’s what we get:
yesql# \i geonames.sample.sql
BEGIN
CREATE SCHEMA
DROP TABLE
SELECT 115904
COPY 115904
COMMIT
Our sample.geonames table only contains 115,904 rows. Another run of the same query yielded 115,071 instead. After all the sampling is made following a random-based algorithm.
Conclusion
When dealing with geolocation data, it’s possible to put the PostgreSQL data type POINT to good use. The PostgreSQL support for GiST indexes makes it easy to then query the data, including for kNN searches and in complex queries.
While it’s possible to just use a flat table with a point column, it might be best to normalize your geolocation database schema. This helps with data quality!
This article is an extract from my book The Art of PostgreSQL, which teaches SQL to developers so that they may replace thousands of lines of code with very simple queries. The book has a full chapter about data types in PostgreSQL, check it out!