Sunday, September 26 2010
PostgreSQL, Emacs

Regexp performances and Finite Automata

The major reason why I dislike perl so much, and ruby too, and the thing I'd want different in the Emacs Lisp API so far is how they set developers mind into using regexp. You know the quote, don't you?

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

That said, some situations require the use of regexp — or are so much simpler to solve using them than the maintenance hell you're building here ain't that big a drag. The given expressiveness is hard to match with any other solution, to the point I sometime use them in my code (well I use rx to lower the burden sometime, just see this example).

(rx bol (zero-or-more blank) (one-or-more digit) ":")

The thing you might want to know about regexp is that computing them is an heavy task usually involving parsing their representation, compiling it to some executable code, and then executing generated code. It's been showed in the past (as soon as 1968) that a regexp is just another way to write a finite automata, at least as soon as you don't need backtracking. The writing of this article is my reaction to reading Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...), a very interesting article — see the benchmarks in there.

The bulk of it is that we find mainly two categories of regexp engine in the wild, those that are using NFA and DFA intermediate representation techniques, and the others. Our beloved PostgreSQL sure offers the feature, it's the ~ and ~* operators. The implementation here is based on Henry Spencer's work, which the aforementioned article says

became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on.

Having a look at the actual implementation shows that indeed, current PostgreSQL code for regexp matching uses intermediate representations of them as NFA and DFA. The code is quite complex, even more than I though it would be, and I didn't have the time it would take to check it against the proposed one from the simple and fast article.

  -rw-r--r--   1 dim  staff   4362 Sep 25 20:59 COPYRIGHT
  -rw-r--r--   1 dim  staff    614 Sep 25 20:59 Makefile
  -rw-r--r--   1 dim  staff  28217 Sep 25 20:59 re_syntax.n
  -rw-r--r--   1 dim  staff  16589 Sep 25 20:59 regc_color.c
  -rw-r--r--   1 dim  staff   3464 Sep 25 20:59 regc_cvec.c
  -rw-r--r--   1 dim  staff  25036 Sep 25 20:59 regc_lex.c
  -rw-r--r--   1 dim  staff  16845 Sep 25 20:59 regc_locale.c
  -rw-r--r--   1 dim  staff  35917 Sep 25 20:59 regc_nfa.c
  -rw-r--r--   1 dim  staff  50714 Sep 25 20:59 regcomp.c
  -rw-r--r--   1 dim  staff  17368 Sep 25 20:59 rege_dfa.c
  -rw-r--r--   1 dim  staff   3627 Sep 25 20:59 regerror.c
  -rw-r--r--   1 dim  staff  27664 Sep 25 20:59 regexec.c
  -rw-r--r--   1 dim  staff   2122 Sep 25 20:59 regfree.c

So all in all, I'll continue avoiding regexp as much as I currently do, and will maintain my tendency to using awk when I need them on files (it allows to refine the searching without resorting to more and more pipes in the command line). And as far as resorting to using regexp in PostgreSQL is concerned, it seems that the code here is already about topnotch. Once more.