In this article, we want to find the town with the greatest number of inhabitants near a given location.
A very localized example
We first need to find and import some data, and I found at the following place a CSV listing of french cities with coordinates and population and some numbers of interest for the exercise here.
To import the data set, we first need a table, then a
COPY
command:
CREATE TABLE lion1906 (
insee text,
nom text,
altitude integer,
code_postal text,
longitude double precision,
latitude double precision,
pop99 bigint,
surface double precision
);
\copy lion1906 from 'villes.csv' with csv header delimiter ';' encoding 'latin1'
With that data in place, we can find the 10 nearest towns of a random choosing of us, let’s pick Villeurbanne which is in the region of Lyon.
select code_postal, nom, pop99
from lion1906
order by point(longitude, latitude) <->
(select point(longitude, latitude)
from lion1906
where nom = 'Villeurbanne')
limit 10;
code_postal | nom | pop99
-------------+------------------------+--------
69100 | Villeurbanne | 124215
69300 | Caluire-et-Cuire | 41233
69120 | Vaulx-en-Velin | 39154
69580 | Sathonay-Camp | 4336
69140 | Rillieux-la-Pape | 28367
69000 | Lyon | 445452
69500 | Bron | 37369
69580 | Sathonay-Village | 1693
01700 | Neyron | 2157
69660 | Collonges-au-Mont-d'Or | 3420
(10 rows)
We find Lyon in our list in there, and we want the query now to return only that one as it has the greatest number of inhabitants in the list:
with neighbours as (
select code_postal, nom, pop99
from lion1906
order by point(longitude, latitude) <->
(select point(longitude, latitude)
from lion1906 where nom = 'Villeurbanne')
limit 10
)
select *
from neighbours
order by pop99 desc
limit 1;
code_postal | nom | pop99
-------------+------+--------
69000 | Lyon | 445452
(1 row)
Well, thank you PostgreSQL, that was easy!
Note that you can actually index such queries, that’s called a
KNN index.
PostgreSQL knows how to use some kind of indexes to fetch data matching an
expression such as
ORDER BY a <-> b
, which allow you to consider a
KNN
search in your application.
Let’s get worldwide
The real scope of our exercise is to associate every known town in the world with some big city around, so let’s first fetch and import some worldwide data this time, from http://download.maxmind.com/download/worldcities/worldcitiespop.txt.gz.
CREATE TABLE maxmind_worldcities (
country_code text,
city_lower text,
city_normal text,
region_code text DEFAULT '',
population INT DEFAULT '0',
latitude float8 DEFAULT '0',
longitude float8 DEFAULT '0'
);
\copy maxmind_worldcities FROM '/tmp/worldcitiespop.txt' WITH DELIMITER ',' QUOTE E'\f' CSV HEADER ENCODING 'LATIN1';
alter table maxmind_worldcities add column loc point;
update maxmind_worldcities set loc = point(longitude, latitude);
This time you can see that I created an extra column with the location in there, so that I don’t have to compute it each time I need it, like I did before.
Now is the time to test that data set and hopefully fetch the same result as before when we only had french cities loaded:
with neighbours as (
select country_code, city_lower, population
from maxmind_worldcities
where population is not null
order by loc <->
(select loc
from maxmind_worldcities
where city_lower = 'villeurbanne')
limit 10
)
select * from neighbours order by population desc limit 1;
country_code | city_lower | population
--------------+------------+------------
fr | lyon | 463700
(1 row)
Ok, looks like we’re all set for the real problem. Now we want to pick for each of those cities it’s nearest neighboor, so here’s how to do that:
create index on maxmind_worldcities(country_code, region_code, city_lower);
create index on maxmind_worldcities using gist(loc);
create table maxmind_neighbours as
select country_code, region_code, city_lower,
(with neighbours as (
select country_code, city_lower, population
from maxmind_worldcities
where population is not null
and country_code = wc.country_code
and region_code = wc.region_code
order by loc <-> wc.loc
limit 10)
select city_lower
from neighbours
order by population desc
limit 1
) as neighbour
from maxmind_worldcities wc ;
To be fair, I have to tell you that this query took almost 2 hours to
complete on my laptop here, but as I’m doing that for friend and a blog
article, I’ve been lazy and didn’t try to optimise it. It could be using
LATERAL
for sure, I don’t know if that would help very much with
performances: I didn’t try.
With that in hands we can now check some cities and their biggest neighbours, as in the following query:
select * from maxmind_neighbours where city_lower = 'villeurbanne';
country_code | region_code | city_lower | neighbour
--------------+-------------+--------------+-----------
fr | B9 | villeurbanne | lyon
(1 row)
And looking for New-York City suburbs I did find a chinatown, which is a pretty common smaller town name apparently:
select * from maxmind_neighbours where city_lower = 'chinatown';
country_code | region_code | city_lower | neighbour
--------------+-------------+------------+---------------
sb | 08 | chinatown | honiara
us | CA | chinatown | san francisco
us | DC | chinatown | washington
us | HI | chinatown | honolulu
us | IL | chinatown | chicago
us | MT | chinatown | missoula
us | NV | chinatown | reno
us | NY | chinatown | new york
(8 rows)
Big Cities in the big world
So, let’s see how many smaller towns each of those random big cities have:
select country_code, region_code, neighbour, count(*)
from maxmind_neighbours
where neighbour in ('london', 'new york', 'moscow',
'paris', 'tokyo', 'sao polo', 'chicago')
group by country_code, region_code, neighbour;
country_code | region_code | neighbour | count
--------------+-------------+-----------+-------
gb | H9 | london | 2
jp | 40 | tokyo | 414
us | NY | new york | 131
ca | 08 | london | 16
ru | 48 | moscow | 245
fr | A8 | paris | 16
us | IL | chicago | 13
(7 rows)
And now let’s be fair and see where are the cities with the greatest number of towns nearby them, with the following query:
select country_code, region_code, neighbour, count(*)
from maxmind_neighbours
where neighbour is not null
group by country_code, region_code, neighbour
order by 4 desc
limit 25;
country_code | region_code | neighbour | count
--------------+-------------+------------+-------
cn | 03 | nanchang | 16759
cn | 26 | xian | 12864
id | 18 | kupang | 10715
cn | 24 | taiyuan | 10550
mm | 11 | taunggyi | 10253
id | 38 | makasar | 9471
ir | 15 | ahvaz | 9461
id | 01 | banda aceh | 9161
cn | 14 | lasa | 8841
cn | 15 | lanzhou | 8618
ir | 29 | kerman | 8579
id | 26 | medan | 7787
ir | 04 | iranshahr | 7249
ir | 07 | shiraz | 7219
ma | 55 | agadir | 7121
ir | 42 | mashhad | 7107
af | 08 | gazni | 7011
ir | 33 | tabriz | 6586
cn | 01 | hefei | 6521
bd | 81 | dhaka | 6480
ir | 08 | rasht | 6471
id | 17 | mataram | 6467
id | 33 | cilegon | 6287
af | 23 | qandahar | 6213
cn | 07 | fuzhou | 6089
(25 rows)