In our Tour of Extensions today’s article is about advanced tag indexing. We have a great data collection to play with and our goal today is to be able to quickly find data matching a complex set of tags. So, let’s find out those lastfm tracks that are tagged as blues and rhythm and blues, for instance.
We’re going to use the Last.fm dataset from the Million Song Dataset project here. The article where they present their similarity searches is providing the whole dataset they have as a SQLite database and they propose to read their python script demo_tags_db.py to make sense of it.
The Setup
First, we need to import this dataset into a PostgreSQL database. To do that the easier path I could think of was to hack the capability into pgloader of course, so here we go:
table name read imported errors time
------------------------------ --------- --------- --------- ---------
tags 0 522366 0 3.922
tids 0 505216 0 3.746
tid_tag 0 8598630 0 115.349
index build completion 0 0 0 33.099
------------------------------ --------- --------- --------- ---------
create index 0 8 0 38.839
reset sequences 0 0 0 0.064
------------------------------ --------- --------- --------- ---------
Total streaming time 0 9626212 0 2m36.180s
Here,
pgloader extracted the table and index definitions from the SQLite
database using the
sqlite_master
catalog and the
PRAGMA table_info()
commands, and migrated the data in a streaming fashion down to PostgreSQL,
using the
COPY protocol.
Having a look at the demo_tags.py script we can actually see how to use the relations here, and we realize they are using the 64-bit signed integer ROWID system column. We need something comparable to be able to make sense of the data:
> alter table tags add column rowid serial;
ALTER TABLE
Time: 3177.603 ms
> alter table tids add column rowid serial;
ALTER TABLE
Time: 2528.680 ms
> SELECT tags.tag, COUNT(tid_tag.tid)
FROM tid_tag, tags
WHERE tid_tag.tag=tags.ROWID and tags.tag ~* 'setzer'
GROUP BY tags.tag;
tag | count
-----------------------------+-------
the brian setzer orchestra | 1
Setzer | 13
rockabilly Setzer style | 4
setzer is a true guitarhero | 9
brian setzer orchestra | 3
brian setzer is GOD | 1
Brian Setzer | 1
brain setzer orchestra | 2
(8 rows)
Time: 644.826 ms
Here the query is mainly doing a
JOIN in between the
tid table (containing
track ids) and the
tid_tag table (containing association in between tracks
and tags), filtering on the
case insensitive regular expression
'setzer'
. As
we can imagine from reading the query execution time, we don’t have any
index to implement the filtering here.
Advanced tag indexing
PostgreSQL comes with plenty of interesting datatypes, one of them is known as the Arrays Type. PostgreSQL also provides a very rich set of extensions, some of them found under the contrib package; one of them is intarray. Let me quote for you the most interesting part of the documentation for that extension:
The @@ and ~~ operators test whether an array satisfies a query, which is expressed as a value of a specialized data type query_int. A query consists of integer values that are checked against the elements of the array, possibly combined using the operators & (AND), | (OR), and ! (NOT). Parentheses can be used as needed. For example, the query 1&(2|3) matches arrays that contain 1 and also contain either 2 or 3.
> create extension intarray;
CREATE EXTENSION
The way the intarray extension works, we need to build a new table that contains for each track the list of tags it’s been associated with, as an array of integers. We’re going to use our rowid identifier for that purpose, as in the following query:
> SELECT tt.tid, array_agg(tags.rowid) as tags
FROM tags JOIN tid_tag tt ON tags.rowid = tt.tag
GROUP BY tt.tid
LIMIT 3;
tid | tags
-----+-----------
1 | {1,2}
2 | {3,4}
3 | {5,6,7,8}
(3 rows)
Time: 942.074 ms
So let’s build the full table then index it:
> CREATE TABLE track_tags AS
SELECT tt.tid, array_agg(tags.rowid) as tags
FROM tags join tid_tag tt on tags.rowid = tt.tag
GROUP BY tt.tid;
SELECT 505216
Time: 45388.424 ms
> create index on track_tags using gin(tags gin__int_ops);
CREATE INDEX
Time: 18645.931 ms
Searches
Now PostgreSQL is ready for the real magic. Let’s find all the tracks we have that have been tagged as both blues and rhythm and blues:
> select array_agg(rowid)
from tags
where tag = 'blues' or tag = 'rhythm and blues';
array_agg
-----------
{3,739}
(1 row)
Time: 0.684 ms
Now what we want is a
query_int query string, which looks like
'(1880&179879)'
, so rather than just
array_agg
we’re going to use the
following query:
> select format('(%s)',
array_to_string(array_agg(rowid), '&')
)::query_int as query
from tags
where tag = 'blues' or tag = 'rhythm and blues';
query
---------
3 & 739
(1 row)
Time: 0.747 ms
That query here allows us to easily inject as many tags as we want to, so that it’s easy to use it as a template from within an application where the user is going to provide for the tags list. The intarray extension’s query format also accepts other operators ( or and not) as we saw before, so if you want to expose those to your users you would need to tweak the query_int building part of the SQL.
Now, how many tracks have been tagged with both the blues and the rhythm and blues tags, will you ask me:
> with t(query) as (
select format('(%s)',
array_to_string(array_agg(rowid), '&')
)::query_int as query
from tags
where tag = 'blues' or tag = 'rhythm and blues'
)
select count(*) from track_tags, t
where tags @@ query;
count
-------
2278
(1 row)
Time: 8.242 ms
Now of course you might want to fetch some track meta-data, here the only one we have is the track hash id:
> with t(query) as (
select format('(%s)',
array_to_string(array_agg(rowid), '&')
)::query_int as query
from tags
where tag = 'blues' or tag = 'rhythm and blues'
)
select track.tid
from track_tags tt join tids track on tt.tid = track.rowid, t
where tt.tags @@ t.query
limit 10;
tid
--------------------
TRCJLCC12903CBF4AE
TRCIFOV128F92F6F4C
TRCYUVJ128F425C8F1
TRCNTFO128F92F6564
TRCDRGT12903CE64BF
TRCWAED128F42A837B
TRCWFEM128F9320F94
TRCQCQH128F932E707
TRCUMTA12903CD67EE
TRJJYUT12903CFB13B
(10 rows)
Time: 7.630 ms
Conclusion
The usual way to handle a set of user defined tags and query against it involves join against a reference table of tags, but then it’s quite complicated to express the full search query: we want tracks tagged with both blues and rhythm and blues and might want then to exclude finger picking.
The intarray extension provides a powerful query specialized language with direct index support, so that you can build dynamic indexes searches directly from your application.