Thursday, 1 December 2011

why I avoid QRegExp in Qt 4 and so should you

A few times, when I've been working on something performance-critical, I've had people suggest (or ask me to review code) using QRegExp. I usually tell them that "this is a bad idea, QRegExp is slow/unmaintained", but I never actually sat down to do benchmarks. Well, in Qt 5, the subject of replacing QRegExp has been discussed a bit, and we have a volunteer: Giuseppe D'Angelo, a Qt hacker and italian student living in the UK (looking for work, by the way!).

One of the first steps he has taken has been to quantify the performance of two of our leading candidates for replacements: PCRE, and ICU's regex engine. He also took the liberty of benchmarking Qt's existing QRegExp - here's the results:

(Caveat: Giuseppe has asked for someone more familiar with ICU to look over the code there to make sure the results aren't negatively impacted.)


RESULT : REBenchmark::PCREBenchmark():"URI":
     1,123 msecs per iteration (total: 1,123, iterations: 1)
RESULT : REBenchmark::PCREBenchmark():"Email":
     1,798 msecs per iteration (total: 1,798, iterations: 1)
RESULT : REBenchmark::PCREBenchmark():"Date":
     99 msecs per iteration (total: 99, iterations: 1)
RESULT : REBenchmark::PCREBenchmark():"URI|Email":
     2,650 msecs per iteration (total: 2,650, iterations: 1)

RESULT : REBenchmark::ICUBenchmark():"URI":
     11,674 msecs per iteration (total: 11,674, iterations: 1)
RESULT : REBenchmark::ICUBenchmark():"Email":
     17,056 msecs per iteration (total: 17,056, iterations: 1)
RESULT : REBenchmark::ICUBenchmark():"Date":
     392 msecs per iteration (total: 392, iterations: 1)
RESULT : REBenchmark::ICUBenchmark():"URI|Email":
     30,552 msecs per iteration (total: 30,552, iterations: 1)

RESULT : REBenchmark::QRegExpBenchmark():"URI":
     21,579 msecs per iteration (total: 21,579, iterations: 1)
RESULT : REBenchmark::QRegExpBenchmark():"Email":
     55,426 msecs per iteration (total: 55,426, iterations: 1)
RESULT : REBenchmark::QRegExpBenchmark():"Date":
     1,357 msecs per iteration (total: 1,357, iterations: 1)
RESULT : REBenchmark::QRegExpBenchmark():"URI|Email":
     77,224 msecs per iteration (total: 77,224, iterations: 1)

(These tests were run on: Ubuntu 10.04 LTS 32 bit, Core(TM)2 Duo CPU P9600  @ 2.66GHz, 4GB RAM, GCC 4.4.3, Qt 4.7.4, PCRE 8.20, ICU 4.8.1.1)

As we see from these results, PCRE is the clear winner, probably thanks to its use of JIT in evaluating the expressions. We see ICU showing decent results, and getting second place (see the caveat above). Leading the rear of the pack, we see QRegExp, 10x-30x slower than PCRE - and these are hardly complicated regular expressions.

In conclusion, for now, if you want performance, you should probably steer clear of QRegExp, and find other ways to do what you want, either using QString directly if it's a simple job, or looking into a light shim around another regex engine until Qt 5, when hopefully, your problems will go away with a shiny new class.

16 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Interesting measurement. Still would be interesting to see numbers for code that keeps a compiled regexp around.

    Also I'd slightly modify your advice: I'd say try QRegExp, but measure if performance is acceptable for your code. Not using the batteries included your toolkit always has a maintenance cost, and then Qt5 also isn't too far away.

    ReplyDelete
  3. that unknown is me, btw. the blogs behind blogger.com just don't know how to deal with relational databases.

    ReplyDelete
  4. QRegExp implicitly keeps a cache of compiled forms internally. There's no seperate form of a compiled and raw pattern.

    (Unless you define QT_NO_REGEXP_OPTIM, for some strange reason.)

    ReplyDelete
  5. I always use PCRE for regexes, it's fast and has rich functionality which I use, for example retrieving named subpatterns (which most other regex libraries simply don't support).

    ReplyDelete
  6. I doubt that regular expression libraries should be used at all in production quality code. It is practical for doing some quick conversions, but you should not forget that it is hackish, for example there are not any compile time checks when using such libraries. And in many cases regular expressions are not even theoretically appropriate to process the given language, in these cases they tend to be unreliable heuristics.

    ReplyDelete
  7. If I were to replace QRegExp in Qt5, I'd probably try replacing it with re2 rather than PCRE or ICU: http://code.google.com/p/re2/

    ReplyDelete
  8. hopefully by "shiny new class" it isn't meant "we'll continue to have QRegExp, just as slow as always, and another regular expression solution that is fast but has a different API." two APIs for the same thing is going to be messy, moving QRegExp into a qt4 support library is going to mean a lot of apps will end up linking to the support library just for that unless the replacement is very close in API ... in short .. there's not enough detail here to know whether to be excited or concerned.

    can you clarify what the plans are for regexp in Qt5? thanks ...

    ReplyDelete
  9. @Aaron: primarily for backwards compatibility, it's not possible to do this as an in-place modification of QRegExp. Different engines have slightly different syntax, so just changing the engine out from underneath everyone is probably going to result in a lot of previously working code suddenly not working, or not working quite as expected.

    Granted, a lot of those cases will be trivial, or trivial to fix - but for Qt 5, where compatibility with Qt 4 is paramount, it's really not worth the risk of breaking something that worked before.

    QRegExp's API is also not considered to be good in some ways - this isn't fixable without a new class.

    This having been said, I trust the people working on it - and the people who will be reviewing the new API - to point out that compatibility should be a goal where possible, to make for a smooth and easy transition.

    @Eike: My understanding is that re2 lacks a number of features that PCRE has, a number of which have been requested over time.

    ReplyDelete
  10. @The User: Are they? Regular expressions perfectly match any regular language. Of course they fail at parsing context free languages, still for whatever strange reason, they usually are used to build the grammar rule of context free languages. Must miss something.

    @Robin: Yes, seems that hidden cache is good enough. Still (like said on IRC) it's worth to point out that the text file used in those tests is ridiculously huge: Several megabytes. When cut to just 10 kLOC, QRegExp finishes all tests within less than 200 msec. Good enough for practical use cases. Therefore would say: No point in ripping QRegExp out of your apps today. You usually can wait until Qt5 delivers.

    ReplyDelete
  11. @Shmeri: About named capture groups, they are not too hard to implement with QRegExp[1]. Wondering if it's worth to still propose such an enhancement for rosty QRegExp still.

    [1] https://gitorious.org/qtaschenorakel/qtaschenorakel/blobs/master/src/sundance/dispatcher.cpp#line15

    ReplyDelete
  12. Can we make use of QRegExp::setPatternSyntax() to deal with the issue of a new version using a slightly different syntax?

    I agree with Aaron that it would be sucky to have two RegExp classes. Is Qt 5 promising source-code compatibility? If it's not 100% assured, I think it's better to change the API now, if necessary. It'll reduce confusion in the long run, and will alert people to the fact that existing QRegExp is slow. (I certainly didn't know this before.)

    ReplyDelete
  13. @Giddie: that doesn't fix the case of QRegExp's API not being optimal, nor is it that nice a solution. It would be forced (due to BC) to keep giving everyone the slow QRegExp implementation, making yet another pitfall that people have to be aware of to get good application performance.

    Qt 5 is promising near-100% SC, only being broken in cases that really warrant it, and while it could be discussed on the ML, I don't think you'd find this case would be one of them, given how much QRegExp is used.

    ReplyDelete
  14. Hi everybody. I'll try to answer as many comments as possible, however I encourage everybody to write on the Qt5 development ML :)

    About the problems with the QRegExp API and feature set: you can check out a list of issues here (it's a WIP page, any help is welcome).

    The biggest problem for keeping the current API is the fact that a QRegExp object is both the pattern and the result of a match. Matching a string with a QRegExp will modify that object, and force you to have N QRegExp objects (for the same regexp) to match N different strings. That's a poor API which needs to be fixed by introducing at least another class, namely, the result of a match.

    QRegExp will not go away in order not to break existing software, and the new API should make porting software as straightforward as possible (static polymorphism, etc.).

    About RE2: unfortunately, it lacks some features, f.i. lookaheads and lookbehinds, which were requested in the list of issues above.

    About the tests: as I said, I'll do more benchmarking covering some other use case, f.i. a one-shot regexp in short to medium texts (this requires some tradeoffs between spending time while compiling the regexp vs. matching).

    @Shmeri: PCRE, RE2 etc. support named patterns. ICU doesn't. Don't ask me why. :-)

    ReplyDelete
  15. Of course regular expressions are fine, I talked about such libraries and their usage.

    ReplyDelete
  16. Is there a way to test Boost::Regex aka std::regex and the other regex systems in boost?

    ReplyDelete