Playing with Unicode
The reason why I like Unicode a lot is because it allows me to code in text based environments and still have nice output. Today, we’re going to play with Regional Indicator Symbol, which is implemented as a Unicode combinaison of letters from 🇦 to 🇿. For instance, if you display 🇫 then 🇷 concatenated together, you get 🇫🇷. Let’s try that from our PostgreSQL prompt!
Table of Contents
Context and Necessary data
In the historical record of motor racing data database we have all the data to reproduce the table from the List of Formula One seasons with driver’s champions and constructors champion.
Grab the MySQL version of it, the only one available now, and import its schema and data in a single command line with pgloader:
$ createdb f1db $ pgloader mysql://[email protected]/f1db pgsql:///f1db $ psql -d f1db -c 'ALTER DATABASE f1db SET search_path TO f1db, public;'
In the List of Formula One seasons page is included a flag from the country of each driver. The f1db database model only includes a nationality column for each driver, that we can see here:
select count(*) as victories, forename, surname, nationality from drivers left join results on drivers.driverid = results.driverid and results.position = 1 group by drivers.driverid order by victories desc limit 10;
Which gives the following result:
victories │ forename │ surname │ nationality ═══════════╪═══════════╪════════════╪═════════════ 91 │ Michael │ Schumacher │ German 56 │ Lewis │ Hamilton │ British 51 │ Alain │ Prost │ French 45 │ Sebastian │ Vettel │ German 41 │ Ayrton │ Senna │ Brazilian 32 │ Fernando │ Alonso │ Spanish 31 │ Nigel │ Mansell │ British 27 │ Jackie │ Stewart │ British 25 │ Jim │ Clark │ British 25 │ Niki │ Lauda │ Austrian (10 rows)
Can we display directly the champion’s country flag in the result set?
The unicode trick
If we apply what we read before in the Regional Indicator Symbol page, displaying a Unicode country flag is as easy as concatenating the country codes together in the right Unicode range. Of course PostgreSQL knows how to do that, thanks to its advanced set of String Functions and Operators:
select chr(127462+ascii('F')-ascii('A')) as a, chr(127462+ascii('R')-ascii('A')) as b, chr(127462+ascii('F')-ascii('A')) || chr(127462+ascii('R')-ascii('A')) as flag;
And yes that’s a good reading of the Unicode specifications for Regional Indicator Symbol, it seems:
a | b | flag ---+---+------ 🇫 | 🇷 | 🇫🇷 (1 row)
As it is quite cumbersome to write this expression each time, we can code a PostgreSQL function that would do that for us:
create or replace function flag ( code text ) returns text language sql as $$ select chr( 127462 + ascii(substring(code from 1 for 1)) - ascii('A')) || chr( 127462 + ascii(substring(code from 2 for 1)) - ascii('A')) $$;
With that it’s easier:
select name, code, flag(code) from country where code is not null limit 10;
And here’s a nice list of flags now:
name | code | flag -------------------+------+------ Afghanistan | AF | 🇦🇫 Albania | AL | 🇦🇱 Algeria | DZ | 🇩🇿 American Samoa | AS | 🇦🇸 Andorra | AD | 🇦🇩 Angola | AO | 🇦🇴 Anguilla | AI | 🇦🇮 Antarctica | AQ | 🇦🇶 Antigua & Barbuda | AG | 🇦🇬 Argentina | AR | 🇦🇷 (10 rows)
This country table comes from http://data.okfn.org/data/core/country-codes. They have a GitHub repository at https://github.com/datasets/country-codes from which you can easily grab the CSV data file then integrate it into PostgreSQL:
begin; create table public.country ( name text, official_name_en text, official_name_fr text, code text, trigram text, M49 text, ITU text, MARC text, WMO text, DS text, Dial text, FIFA text, FIPS text, GAUL text, IOC text, currency_alphabetic_code text, currency_country_name text, currency_minor_unit text, currency_name text, currency_numeric_code text, is_independent text, Capital text, Continent text, TLD text, Languages text, Geoname text, EDGAR text ); \copy public.country from 'country-codes.csv' with delimiter ',' csv header quote '"' commit;
Nationality and Country Code Mapping
We said before that our Formula One database only has a nationality column for our drivers. How can we map that into country codes, so that we can diplay our nice little Unicode flag?
Well this time I failed to find a ready to use dataset, but I found an easy to scrap web page with the needed information at http://www.ef.com/english-resources/english-grammar/nationalities/.
Some lines of python later:
#! /usr/bin/env python3 import re def parse_nationalities(filename): regexp = '\t\t\t\t(.*)</td>$' # Given that regexp we want to keep the first match as the country name # then the second match as the nationality, then skip the third line current = None skip = False with open(filename, 'r') as f: for line in f: match = re.search(regexp, line) if match: if skip: skip = False elif current is None: current = match.group(1) else: print("%s;%s" % (current, match.group(1))) current = None skip = True if __name__ == "__main__": parse_nationalities('nationalities.html')
It is now possible to load the data set thanks to the following SQL:
begin; create table public.nationalities ( country text, nationality text ); \copy public.nationalities from 'nationalities.csv' with delimiter ';' csv insert into public.nationalities(nationality, country) values ('American', 'US'), ('American-Italian', 'US'), ('Argentine-Italian', 'Argentina'), ('Chilean', 'Chile'), ('East German', 'Germany'), ('Liechtensteiner', 'Liechtenstein'), ('Monegasque', 'Monaco'), ('New Zealander', 'New Zealand'), ('Rhodesian', 'Zimbabwe'), ('Venezuelan', 'Venezuela'); update public.nationalities set country = 'Netherlands' where country = 'Holland'; update public.nationalities set country = 'US' where country = 'The United States'; update public.nationalities set country = 'UK' where country = 'Britain'; update public.nationalities set country = 'Czech Republic' where country = 'the Czech Republic'; commit;
Note that to have matches for all our drivers, we did need to edit the list provided so that it would match with the set of nationalities found in the f1db database. You can list those with the following query:
select distinct(nationality) from drivers;
And more to the point, you can find any missing nationality in your new table set with the following anti-join query:
select nationality, country.code from drivers left join nationalities using(nationality) left join country on country.name = nationalities.country where country.code is null;
Thanks to the modification we did previously after importing the dataset, this query should return an empty result set.
Formula One drivers and country flags
We now have enough data to finally play our unicode trick and display driver’s country flags directly in our query result set on the console:
select count(*) as victories, forename, surname, flag(country.code) from drivers left join results on drivers.driverid = results.driverid and results.position = 1 join nationalities using(nationality) join country on country.name = nationalities.country group by drivers.driverid, country.code order by victories desc limit 10;
And here’s our result with inline colored flags in the terminal:
victories | forename | surname | flag -----------+-----------+------------+------ 91 | Michael | Schumacher | 🇩🇪 56 | Lewis | Hamilton | 🇬🇧 51 | Alain | Prost | 🇫🇷 45 | Sebastian | Vettel | 🇩🇪 41 | Ayrton | Senna | 🇧🇷 32 | Fernando | Alonso | 🇪🇸 31 | Nigel | Mansell | 🇬🇧 27 | Jackie | Stewart | 🇬🇧 25 | Jim | Clark | 🇬🇧 25 | Niki | Lauda | 🇦🇹 (10 rows)