Friday, February 14, 2014

Converting PPQT: which RE lib?

I'm working on a lengthy project to make version 2 of PPQT, a large Python/Qt app. I'm documenting some of the things I learn in occasional blog posts.

PPQT 1 makes frequent use of regular expressions, mostly using Qt's QRegExp class. That has to change for two reasons. One is that QRegExp falls quite a bit short of PCRE compatibility. Qt5 includes a new class, QRegularExpression, which does claim PCRE compatibility as well as performance, so at least I want to convert the old ones to the longer-named type.

However, one big difference from PyQt4 to 5 is the "new API" that abolishes use of QString. In PyQt4 many class methods take, or return, QStrings, and PPQT uses lots of QString objects. QStrings and QRegExps work well together; QRegExp.indexIn() takes a QString, and QString.find() takes a QRegExp.

In PyQt5, all classes that (in the C++ documentation) take or return a QString, now take or return a simple Python string value, with PyQt5 doing automatic conversion. There is no "QString" class in PyQt5—at all! That means there is no way to call QString.find(), and if you call QRegExp.indexIn(string), there will be a hidden conversion from Python to QString. Which means—why use Qt regexes at all? Since all program-accessible strings are Python strings, why not use Python's own regular expression support?

Standard Python support is the "re" lib. It also is not PCRE compatible (although closer than QRegExp) and not known for speed. But there is another: the "regex" module, which intends to become the Python standard but now is an optional install. It is PCRE-compatible, with the Unicode property searches and Unicode case-folding that are lacking in QRegExp and in the re module. It actually adds more functionality, including "fuzzy" matches that could be very useful to me in PPQT. The class and method names are the same as the standard re module.

Code Changes

One design difference between Python's re/regex and QRegularExpression on one side, and the QRegExp that PPQT 1 uses so many of on the other, will cause some code changes.

An instance of QRegExp is not reentrant: when it is used, it stores information about the match position and capture groups in the regex object. Such an object shouldn't be a global or class variable shared between instances of a class, because activity in one using method could overwrite a match found from another. But based on its design, PPQT 1 had frequent uses like this:

    if 0 <= re_object.indexIn(string):
        cap1 = re_object.cap(1)

Both re/regex and QRegularExpression take a different approach: the regex object knows about the search pattern but is otherwise immutable. When you perform a search with it it, it returns a match object that encodes the positions and lengths of the matched and captured substrings. The regex object can be a global; every using method gets its private match object to work with. However, code like that above has to be rewritten (using Python re/regex) as:

    match = re_object.match(string)
    if match : # i.e. result was not None
        cap1 = match.cap(1)

Python re/relib match returns None on failure, or a match object. None evaluates as False, so "if match" is equivalent to "if a match was found." The returned value of a QRegularExpression object is always a (take a deep breath) QRegularExpressionMatch object, so the equivalent would be:

    match = re_object.match(string)
    if match.isValid() : # match succeeded
        cap1 = match.captured(1)

Not only is this many more keystrokes to write, it entails two pointless auto-conversions between Python and Qt string types: from Python to Qstring in the match() call, and from QString to Python in the captured(1). All told, the Python relib seems a better choice and I plan to use it exclusively in PPQT 2.

No comments: