Continuing our series of PostgreSQL Data Types today we’re going to introduce the PostgreSQL array data types.
Arrays can be used to denormalize data and avoid lookup tables. A good rule of thumb for using them that way is that you mostly use the array as a whole, even if you might at times search for elements in the array. Heavier processing is going to be more complex than a lookup table.
Table of Contents
A Data Type to Play With
A classic example of a good use case for PostgreSQL arrays is user-defined tags. For the next example, 200,000 USA geolocated tweets have been loaded into PostgreSQL thanks to the following script:
begin;
create table tweet
(
id bigint primary key,
date date,
hour time,
uname text,
nickname text,
bio text,
message text,
favs bigint,
rts bigint,
latitude double precision,
longitude double precision,
country text,
place text,
picture text,
followers bigint,
following bigint,
listed bigint,
lang text,
url text
);
\copy tweet from 'tweets.csv' with csv header delimiter ';'
commit;
Once the data is loaded we can have a look at it:
\pset format wrapped
\pset columns 70
table tweet limit 1;
Here’s what it looks like:
─[ RECORD 1 ]────────────────────────────────────────────────────────
id │ 721318437075685382
date │ 2016-04-16
hour │ 12:44:00
uname │ Bill Schulhoff
nickname │ BillSchulhoff
bio │ Husband,Dad,GrandDad,Ordained Minister, Umpire, Poker Pla…
│…yer, Mets, Jets, Rangers, LI Ducks, Sons of Anarchy, Surv…
│…ivor, Apprentice, O&A, & a good cigar
message │ Wind 3.2 mph NNE. Barometer 30.20 in, Rising slowly. Temp…
│…erature 49.3 °F. Rain today 0.00 in. Humidity 32%
favs │ ¤
rts │ ¤
latitude │ 40.76027778
longitude │ -72.95472222
country │ US
place │ East Patchogue, NY
picture │ http://pbs.twimg.com/profile_images/378800000718469152/53…
│…5032cf772ca04524e0fe075d3b4767_normal.jpeg
followers │ 386
following │ 705
listed │ 24
lang │ en
url │ http://www.twitter.com/BillSchulhoff/status/7213184370756…
│…85382
We can see that the raw import schema is not a good fit for PostgreSQL capabilities. The date and hour fields are separated for no good reason, and it makes processing them less easy than when they form a timestamptz together. PostgreSQL does know how to handle longitude and latitude as a single point entry, allowing much more interesting processing again. We can create a simpler relation to manage and process a subset of the data we’re interested in for this chapter.
Introduction to Arrays
PostgreSQL has built-in support for arrays, which are documented in the Arrays and the Array Functions and Operators chapters. As introduced above, what’s interesting with PostgreSQL is its ability to process array elements from SQL directly. This capability includes indexing facilities thanks to GIN indexing.
As we are interested in the tags used in the messages, the next query also extracts all the tags from the Twitter messages as an array of text.
begin;
create table hashtag
(
id bigint primary key,
date timestamptz,
uname text,
message text,
location point,
hashtags text[]
);
with matches as (
select id,
regexp_matches(message, '(#[^ ,]+)', 'g') as match
from tweet
),
hashtags as (
select id,
array_agg(match[1] order by match[1]) as hashtags
from matches
group by id
)
insert into hashtag(id, date, uname, message, location, hashtags)
select id,
date + hour as date,
uname,
message,
point(longitude, latitude),
hashtags
from hashtags
join tweet using(id);
commit;
The PostgreSQL matching function regexp_matches() implements what we need here, with the g flag to return every match found and not just the first tag in a message. Those multiple matches are returned one per row, so we then group by tweet id and array_agg over them, building our array of tags. Here’s what the computed data looks like:
select id, hashtags
from hashtag
limit 10;
In the following data output, you can see that we kept the # signs in front of the hashtags, making it easier to recognize what this data is:
id │ hashtags
════════════════════╪═════════════════════════════════════════════════
720553447402160128 │ {#CriminalMischief,#ocso,#orlpol}
720553457015324672 │ {#txwx}
720553458596757504 │ {#DrugViolation,#opd,#orlpol}
720553466804989952 │ {#Philadelphia,#quiz}
720553475923271680 │ {#Retail,#hiring!,#job}
720553508190052352 │ {#downtown,#early…,#ghosttown,#longisland,#morn…
│…ing,#portjeff,#portjefferson}
720553522966581248 │ {"#CapitolHeights,",#Retail,#hiring!,#job}
720553530088669185 │ {#NY17}
720553531665682434 │ {#Endomondo,#endorphins}
720553532273795072 │ {#Job,#Nursing,"#Omaha,",#hiring!}
(10 rows)
Indexing PostgreSQL Arrays for Statistics and Profit
Before processing the tags, we create a specialized GIN index. This index access method allows PostgreSQL to index the contents of the arrays, the tags themselves, rather than each array as an opaque value.
create index on hashtag using gin (hashtags);
A popular tag in the dataset is #job, and we can easily see how many times it’s been used, and confirm that our previous index makes sense for looking inside the hashtags array:
explain (analyze, verbose, costs off, buffers)
select count(*)
from hashtag
where hashtags @> array['#job'];
And we can see the execution plan:
QUERY PLAN
══════════════════════════════════════════════════════════════════════
Aggregate (actual time=27.227..27.227 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=3715
-> Bitmap Heap Scan on public.hashtag (actual time=13.023..23.453…
… rows=17763 loops=1)
Output: id, date, uname, message, location, hashtags
Recheck Cond: (hashtag.hashtags @> '{#job}'::text[])
Heap Blocks: exact=3707
Buffers: shared hit=3715
-> Bitmap Index Scan on hashtag_hashtags_idx (actual time=1…
…1.030..11.030 rows=17763 loops=1)
Index Cond: (hashtag.hashtags @> '{#job}'::text[])
Buffers: shared hit=8
Planning time: 0.596 ms
Execution time: 27.313 ms
(13 rows)
That was done supposing we already know one of the popular tags. How do we get to discover that information, given our data model and data set? We do it with the following query:
select tag, count(*)
from hashtag, unnest(hashtags) as t(tag)
group by tag
order by count desc
limit 10;
This time, as the query must scan all the hashtags in the table, it won’t use the previous index of course. The unnest() function is a must-have when dealing with arrays in PostgreSQL, as it allows processing the array’s content as if it were just another relation. And SQL comes with all the tooling to process relations, as we see in great details in my book The Art of PostgresQL.
So we can see the most popular hashtags in our dataset:
tag │ count
══════════════╪═══════
#Hiring │ 37964
#Jobs │ 24776
#CareerArc │ 21845
#Job │ 21368
#job │ 17763
#Retail │ 7867
#Hospitality │ 7664
#job? │ 7569
#hiring! │ 6860
#Job: │ 5953
(10 rows)
The hiring theme is huge in this dataset. We could then search for mentions of job opportunities in the #Retail sector (another popular hashtag we just discovered into the data set), and have a look at the locations where they are saying they’re hiring:
select name,
substring(timezone, '/(.*)') as tz,
count(*)
from hashtag
left join lateral
(
select *
from geonames
order by location <-> hashtag.location
limit 1
)
as geoname
on true
where hashtags @> array['#Hiring', '#Retail']
group by name, tz
order by count desc
limit 10;
For this query a dataset of geonames has been imported. The left join lateral allows picking the nearest location to the tweet location from our geoname reference table. The where clause only matches the hashtag arrays containing both the #Hiring and the #Retail tags. Finally, we order the data set by most promising opportunities:
name │ tz │ count
══════════════════════════════════════════════════╪═════════════╪═══════
San Jose City Hall │ Los_Angeles │ 31
Sleep Inn & Suites Intercontinental Airport East │ Chicago │ 19
Los Angeles │ Los_Angeles │ 14
Dallas City Hall Plaza │ Chicago │ 12
New York City Hall │ New_York │ 11
Jw Marriott Miami Downtown │ New_York │ 11
Gold Spike Hotel & Casino │ Los_Angeles │ 10
San Antonio │ Chicago │ 10
Shoppes at 104 │ New_York │ 9
Fruitville Elementary School │ New_York │ 8
(10 rows)
Conclusion
PostgreSQL arrays are very powerful, and GIN indexing support makes them efficient to work with. Nonetheless, it’s still not so efficient that you would replace a lookup table with an array in situations where you do a lot of lookups, though.
Also, some PostgreSQL array functions show a quadratic behavior: looping over arrays elements really is inefficient, so learn to use unnest() instead, and filter elements with a where clause. If you see yourself doing that a lot, it might be a good sign that you really needed a lookup table!
This article is an extract from my book The Art of PostgresQL, which teaches SQL to developers so that they may replace thousands of lines of code with very simple queries. The book has a full chapter about data types in PostgreSQL, check it out!