Using trigrams against typos

In our ongoing Tour of Extensions we played with earth distance in How far is the nearest pub? then with hstore in a series about trigger, first to generalize Trigger Parameters then to enable us to Auditing Changes with Hstore. Today we are going to work with pg_trgm which is the trigrams PostgreSQL extension: its usage got seriously enhanced in recent PostgreSQL releases and it’s now a poor’s man Full Text Search engine.

Contents

Of course we also have the rich men version with Text Search Parsers and several kinds of dictionnaries with support for stemming, thesaurus or synomyms support, and a full text query language and tools for ranking search result. So if what you need really is Full Text Search then go check the docs.

The use trigrams is often complementary to Full Text Search. With trigrams we can implement typing correction suggestions or index like and POSIX Regular Expressions searches.

Whatever the use case, it all begins as usual by enabling the extension within your database server. If you’re running from PostgreSQL packages be sure to always install the contrib package, really. A time will come when you need it and you will then be happy to only have to type CREATE EXTENSION to get started.

# create extension pg_trgm;
CREATE EXTENSION

Setting up the use case

The use case I want to show today is to suggest corrections to some words the user did obviously typoed, because your search form is not finding any result. Or to offer suggest as you type feature maybe, doing a database search for approximate matching strings in a kind of catalog that you have to offer auto-completion.

One easy to use catalog here is the Dell DVD Store Database Test Suite that you can download also as a ready to use PostgreSQL text dump at http://pgfoundry.org/frs/download.php/543/dellstore2-normal-1.0.tar.gz.

This small database offers ten thousands products and simplifies the schema so much as to offer a single column actor in the products table. Let’s pretend we just filled in a search box to find products by actor name, but we don’t know the right spelling of the actor’s name or maybe the cat really wanted to help us on the keyboard that day.

The trigram extension comes with two operators of interest for this situation here, which are the similarity operator named % and the distance operator named <->. The similarity operator will compare the list of trigrams extracted from the query terms with those extracted from each data of our table, and filter out those rows where the data is considered not similar enough.

> select show_trgm('tomy') as tomy,
         show_trgm('Tomy') as "Tomy",
         show_trgm('tom torn') as "tom torn",
         similarity('tomy', 'tom'),
         similarity('dim', 'tom');

-[ RECORD 1 ]-------------------------------------
tomy       | {"  t"," to","my ",omy,tom}
Tomy       | {"  t"," to","my ",omy,tom}
tom torn   | {"  t"," to","om ",orn,"rn ",tom,tor}
similarity | 0.5
similarity | 0

As you can read in the PostgreSQL trigram extension documentation the default similarity threshold is 0.3 and you can tweak it by using the functions set_limit().

Now let’s find out all those actors whose name looks like tomy, as clearly the user did enter that in the search box but we found no exact match for it:

> select *
    from products
   where actor ~* 'tomy';
 
 prod_id | category | title | actor | price | special | common_prod_id 
---------+----------+-------+-------+-------+---------+----------------
(0 rows)

> select actor
    from products
   where actor % 'tomy';
  
  actor   
----------
 TOM TORN
 TOM DAY
(2 rows)

Time: 26.972 ms

Trigram indexing

That’s a little too much time on that query when we consider only 10,000 entries in our table, let’s try and do better than that:

# create index on products using gist(actor gist_trgm_ops);
CREATE INDEX

Now if we run the exact same query we get our result in less than 3 milliseconds, which is more like something we can push to production.

> select actor
    from products
   where actor % 'tomy';
  
  actor   
----------
 TOM TORN
 TOM DAY
(2 rows)

Time: 2.695 ms

Oh and by the way, did you know that the ~* operator we used above to discover that there’s not a single Tony actor in our products table, that ~* operator implements a case insensitive posix regex search in PostgreSQL? Isn’t that awesome? Now, on to the next surprise, have a look at that explain plan:

> explain (costs off)
  select * from products where actor ~* 'tomy';
                   QUERY PLAN                    
-------------------------------------------------
 Index Scan using products_actor_idx on products
   Index Cond: ((actor)::text ~* 'tomy'::text)
(2 rows)

In PostgreSQL 9.3 the trigram extension is able to solve regular expression searches. The first production release of 9.3 should happen as soon as next week, I hope you’re ready for it!

Auto Completion

What if you want to offer as-you-type completion to the names of the actors we know in our catalog? Then maybe you will find the following query useful:

>   select actor
      from products
     where actor % 'fran'
  order by actor <-> 'fran'
     limit 10;
    actor     
--------------
 FRANK HAWKE
 FRANK BERRY
 FRANK POSEY
 FRANK HAWKE
 FRANCES DEE
 FRANK LEIGH
 FRANCES DAY
 FRANK FOSTER
 FRANK HORNE
 FRANK TOMEI
(10 rows)

Time: 2.960 ms

Note that without the WHERE clause to filter on the trigram similarity I get run times of 30ms rather than 3ms in my tests here, because the GiST index is not used then. As usual EXPLAIN is your friend and remember that a query plan will change depending on the volume of your data set as known by the PostgreSQL planner statistics.

Conclusion

The trigram extension allows indexing like searches and regular expression searches, and also know how to compute similarity and distance in between texts, and how to index that. That’s another power tool included with PostgreSQL. Another reason why you won’t believe how much behind the other database systems you know of really are, if you ask me.

Oh, and get ready for PostgreSQL 9.3. Another release packed with awesome.