Playing with Unicode

The reason why I like Unicode a lot is because it allows me to code in text based environments and still have nice output. Today, we’re going to play with Regional Indicator Symbol, which is implemented as a Unicode combinaison of letters from 🇦 to 🇿. For instance, if you display 🇫 then 🇷 concatenated together, you get 🇫🇷. Let’s try that from our PostgreSQL prompt!

Context and Necessary data

In the historical record of motor racing data database we have all the data to reproduce the table from the List of Formula One seasons with driver’s champions and constructors champion.

Grab the MySQL version of it, the only one available now, and import its schema and data in a single command line with pgloader:

$ createdb f1db
$ pgloader mysql://root@localhost/f1db pgsql:///f1db
$ psql -d f1db -c 'ALTER DATABASE f1db SET search_path TO f1db, public;'

In the List of Formula One seasons page is included a flag from the country of each driver. The f1db database model only includes a nationality column for each driver, that we can see here:

  select count(*) as victories,
         forename, surname, nationality
    from drivers
         left join results
                on drivers.driverid = results.driverid
               and results.position = 1
group by drivers.driverid
order by victories desc
   limit 10;

Which gives the following result:

 victories │ forename  │  surname   │ nationality 
═══════════╪═══════════╪════════════╪═════════════
        91 │ Michael   │ Schumacher │ German
        56 │ Lewis     │ Hamilton   │ British
        51 │ Alain     │ Prost      │ French
        45 │ Sebastian │ Vettel     │ German
        41 │ Ayrton    │ Senna      │ Brazilian
        32 │ Fernando  │ Alonso     │ Spanish
        31 │ Nigel     │ Mansell    │ British
        27 │ Jackie    │ Stewart    │ British
        25 │ Jim       │ Clark      │ British
        25 │ Niki      │ Lauda      │ Austrian
(10 rows)

Can we display directly the champion’s country flag in the result set?

The unicode trick

If we apply what we read before in the Regional Indicator Symbol page, displaying a Unicode country flag is as easy as concatenating the country codes together in the right Unicode range. Of course PostgreSQL knows how to do that, thanks to its advanced set of String Functions and Operators:

select chr(127462+ascii('F')-ascii('A')) as a,
       chr(127462+ascii('R')-ascii('A')) as b,
          chr(127462+ascii('F')-ascii('A'))
       || chr(127462+ascii('R')-ascii('A')) as flag;

And yes that’s a good reading of the Unicode specifications for Regional Indicator Symbol, it seems:

 a | b | flag 
---+---+------
 🇫 | 🇷 | 🇫🇷
(1 row)

As it is quite cumbersome to write this expression each time, we can code a PostgreSQL function that would do that for us:

create or replace function flag
 (
   code text
 )
 returns text
language sql
as $$
 select    chr(  127462
               + ascii(substring(code from 1 for 1))
               - ascii('A'))

        || chr(  127462
               + ascii(substring(code from 2 for 1))
               - ascii('A'))
$$;

With that it’s easier:

select name, code, flag(code)
  from country
 where code is not null
 limit 10;

And here’s a nice list of flags now:

       name        | code | flag 
-------------------+------+------
 Afghanistan       | AF   | 🇦🇫
 Albania           | AL   | 🇦🇱
 Algeria           | DZ   | 🇩🇿
 American Samoa    | AS   | 🇦🇸
 Andorra           | AD   | 🇦🇩
 Angola            | AO   | 🇦🇴
 Anguilla          | AI   | 🇦🇮
 Antarctica        | AQ   | 🇦🇶
 Antigua & Barbuda | AG   | 🇦🇬
 Argentina         | AR   | 🇦🇷
(10 rows)

Country Codes

This country table comes from http://data.okfn.org/data/core/country-codes. They have a GitHub repository at https://github.com/datasets/country-codes from which you can easily grab the CSV data file then integrate it into PostgreSQL:

begin;

create table public.country
 (
   name                             text,
   official_name_en                 text,
   official_name_fr                 text,
   code                             text,
   trigram                          text,
   M49                              text,
   ITU                              text,
   MARC                             text,
   WMO                              text,
   DS                               text,
   Dial                             text,
   FIFA                             text,
   FIPS                             text,
   GAUL                             text,
   IOC                              text,
   currency_alphabetic_code         text,
   currency_country_name            text,
   currency_minor_unit              text,
   currency_name                    text,
   currency_numeric_code            text,
   is_independent                   text,
   Capital                          text,
   Continent                        text,
   TLD                              text,
   Languages                        text,
   Geoname                          text,
   EDGAR                            text
 );

\copy public.country from 'country-codes.csv' with delimiter ',' csv header quote '"'

commit;

Nationality and Country Code Mapping

We said before that our Formula One database only has a nationality column for our drivers. How can we map that into country codes, so that we can diplay our nice little Unicode flag?

Well this time I failed to find a ready to use dataset, but I found an easy to scrap web page with the needed information at http://www.ef.com/english-resources/english-grammar/nationalities/.

Some lines of python later:

#! /usr/bin/env python3

import re

def parse_nationalities(filename):
    regexp = '\t\t\t\t(.*)</td>$'

    # Given that regexp we want to keep the first match as the country name
    # then the second match as the nationality, then skip the third line
    current = None
    skip = False

    with open(filename, 'r') as f:
        for line in f:
            match = re.search(regexp, line)
            if match:
                if skip:
                    skip = False
                elif current is None:
                    current = match.group(1)
                else:
                    print("%s;%s" % (current, match.group(1)))
                    current = None
                    skip = True

if __name__ == "__main__":
    parse_nationalities('nationalities.html')

It is now possible to load the data set thanks to the following SQL:

begin;

create table public.nationalities
 (
   country     text,
   nationality text
 );


\copy public.nationalities from 'nationalities.csv' with delimiter ';' csv

insert into public.nationalities(nationality, country)
     values ('American', 'US'),
            ('American-Italian', 'US'),
            ('Argentine-Italian', 'Argentina'),
            ('Chilean', 'Chile'),
            ('East German', 'Germany'),
            ('Liechtensteiner', 'Liechtenstein'),
            ('Monegasque', 'Monaco'),
            ('New Zealander', 'New Zealand'),
            ('Rhodesian', 'Zimbabwe'),
            ('Venezuelan', 'Venezuela');

update public.nationalities
   set country = 'Netherlands'
 where country = 'Holland';

update public.nationalities
   set country = 'US'
 where country = 'The United States';

update public.nationalities
   set country = 'UK'
 where country = 'Britain';

update public.nationalities
   set country = 'Czech Republic'
 where country = 'the Czech Republic';

commit;

Note that to have matches for all our drivers, we did need to edit the list provided so that it would match with the set of nationalities found in the f1db database. You can list those with the following query:

select distinct(nationality) from drivers;

And more to the point, you can find any missing nationality in your new table set with the following anti-join query:

select nationality, country.code
  from drivers
       left join nationalities using(nationality)
       left join country
              on country.name = nationalities.country
 where country.code is null;

Thanks to the modification we did previously after importing the dataset, this query should return an empty result set.

Formula One drivers and country flags

We now have enough data to finally play our unicode trick and display driver’s country flags directly in our query result set on the console:

  select count(*) as victories,
         forename, surname,
         flag(country.code)
    from drivers
         left join results
                on drivers.driverid = results.driverid
               and results.position = 1
         join nationalities using(nationality)
         join country
           on country.name = nationalities.country
group by drivers.driverid, country.code
order by victories desc
   limit 10;

Conclusion

And here’s our result with inline colored flags in the terminal:

 victories | forename  |  surname   | flag 
-----------+-----------+------------+------
        91 | Michael   | Schumacher | 🇩🇪
        56 | Lewis     | Hamilton   | 🇬🇧
        51 | Alain     | Prost      | 🇫🇷
        45 | Sebastian | Vettel     | 🇩🇪
        41 | Ayrton    | Senna      | 🇧🇷
        32 | Fernando  | Alonso     | 🇪🇸
        31 | Nigel     | Mansell    | 🇬🇧
        27 | Jackie    | Stewart    | 🇬🇧
        25 | Jim       | Clark      | 🇬🇧
        25 | Niki      | Lauda      | 🇦🇹
(10 rows)

Fun, right?