Thursday, February 27, 2014

A really dumb error and hard to find...

Just a little post about a stupid mistake to keep the blog alive.

So I am to the point of beginning unit-test of another module. (So far there are two (2!!!) modules that are coded and have unit-test drivers that invoke 100% code coverage and they run, ta-daa!) So this will be the third. (Third in a series of a-number-too-large-to-contemplate-on-an-empty-stomach.)

So worddata.py is the first module that needs access to resources created by The Book, the thing that holds all resources unique to one document/book. Specifically it needs access to the metadata manager so it can register readers and writers for four kinds of metadata, and access to a spellcheck object, and to The Document, an enhanced QTextDocument that is the data model for the editor.

Which meant I had to create some kind of stubbed-out Book class that could create those three resources (editdata and metadata are the two coded-and-running modules, and spellcheck is another stub so far), and return references to them on request. So I did, it's 20 lines of code starting with the usual stuff,

class Book(QObject):
    def _init_(self, main_window): #TODO: API?
        super().__init__(main_window)
        #
        # Create the metadata manager
        #
        self.metamgr = metadata.MetaMgr(self)
        etc etc...

Notice anything wrong? I didn't, and my fingers typed it.

So execute this puppy, the line the_book = book.Book(fake_main_window) is executed instantly but later code fails because "Book object has no attribute 'metamgr'". WTF? It's right there, second line in the initializer. Put a breakpoint on that line, sho nuff it is never executed.

Really, it took me ten minutes to notice the shortage of underscores on _init_. Of course in Python _init_ is a valid function name. It just won't be called to initialize a new object... You might ask, how could I get it wrong for my own class but call the correct name when initializing the super-class? Because Wing IDE recognizes "super()." and pops up a list with the only two options, "__class__" and "__init__" and I just blindly clicked the one I wanted.

OK while I am venting, before I could even get to this point, I had to solve another issue. The worddata module depends critically on blist.sorteddict, but the "from blist import sorteddict" statement failed. No blist installed. Oh right, I tried it out under Python 2.7; need to install it in Python 3.3 now, too. "pip install blist" fails with obscure error message. Google around and search SO and figure out it must be back-level setuptools. "pip install -U setuptools" fails with a different, obscure message. More googling, at least an hour wasted. However, "easy_install -U setuptools" does work, after which "pip install blist" works.

And may I say the current state of python distribution tools is horrendous! There is pip and easy_install which rely on setuptools or distutils or distutils2, and there are zips and eggs and now wheels, and numpy is off on its own thing with conda... argle bargle. Piffle.

Tuesday, February 18, 2014

Model-View design and user expectations of performance

PPQT presents

  • A table of the words in the document with their properties such as uppercase, numeric, misspelled.
  • A table of the characters in the document with their counts and Unicode categories
  • A table of the book pages, derived from the original PGDP page-separator lines

Each of these tables is derived from a "census" in which every line, word-token, and character in the document is counted. In v.1 this census is done the first time a book is opened, and any time after when the user needs to "refresh" the display of word or character counts. It's very time-consuming, 5 to 20 seconds for a large book. Getting the time down for v.2 would be a good thing. So would avoiding a big delay during first opening of a new book.

In v.1 the census is done in one rather massive block of code that fetches each line from the QTextDocument in turn as a QString and parses each, counting the characters and using a massive regex to pick out wordlike tokens. This process is complicated by:

  • The need to handle the PG codes for non-Latin-1 chars such as [oe].
  • The need to recognize HTML-like productions: some like <i> and <sc> are common from the start, and later in the book-production process there might be thousands of HTML codes; we count them for characters but not for "words".
  • But also the need to spot the lang=code property embedded in HTML codes, to signal use of an alternate spelling dictionary.

For v.2 I want to break up the management of all these metadata along MVC lines, with a "data" module and a "view" module for each type, so worddata.py manages the list of words while wordview.py contains the code to present that data using a QTableView and assorted buttons. Similarly for chardata/charview and pagedata/pageview. But will this complicate the census process? Will it slow it down?

Complicate it? Not exactly; more like "distribute" it. I will move each type of census to its data model: worddata will take a word census, chardata a char census, pagedata a page census. So a full census could potentially entail three passes over the document.

However, when this separation is done, it becomes clear that the only census that really needs to be done the first time a book is opened, is the page census. That's because the module that displays the matching page scan image as the user moves through the text, needs to know the position of each page's start. In other words, pagedata is the data model for both the page table and the image-display panel. Images need to be displayed immediately, so the page data needs to be censused the first time a book is opened.

The word and char censii, however, can wait. The char data is the model only for the Char panel. If that panel is showing an empty table, the user knows to click its "Refresh" button to make a char census happen, so the table updates.

The word data is the model for the Word panel, and again, if the user opens a new book and goes to the Word panel, and sees an empty table, it's a no-brainer to click Refresh and update the table. In either case, the user knows they've asked for something, and should be content to wait while the progress bar turns and the census finishes.

The word data is also the model, however, for the display of misspelled words with a red underline, and the display of "scannos", highlighted document words that appear in a file of likely OCR errors. These features of the editor are turned on with a menu choice (? or perhaps a check box in v.2? TBS). If either highlighter is set ON when a new book is opened, the highlights won't happen because the word data isn't known until a census is taken.

Easy solution: we know when we are opening a new book (we don't see a matching metadata file from a prior save), and in that case we force OFF the spellcheck and scanno highlight choices. Then if/when the user clicks spelling or scanno highlights ON, we can run a census at that time. Again the potentially slow process is initiated by an explicit user action.

What about (perceived) performance? It should be snappier. If you Refresh the Chars panel it will rip through the document counting characters, but not spend time on the big word-token regex html-skipping process. Refresh the Words panel and its census will at least not be slowed by counting characters.

Great, but I already started coding worddata on the assumption it would base both chars and words. Now I have to split it up.

Monday, February 17, 2014

Python logging and unit testing

PPQT 2 is to be pretty much a complete rewrite of version 1. I built the first version in an ad-hoc way, adding features one at a time to the basic editor, and as a result its software structure is rather ramshackle. Information about different data structures and formats leaks all over. So now I know where it's going, the next version can be properly compartmentalized and structured.

And better-tested! V1 got "tested" by my using it. V2, I am determined, will have a separate unit-test driver for each module, and every added function means adding test code to exercise it. We be professional here!

And logging! V1 has no logging of any kind. There may be one or two places where an except clause has a print statement in it (blush) but that's it. So I read up on Python logging, and each module will have its named logger and log some occasional INFO lines, always WARN lines where the module is working around some problem, and occasionally ERROR lines.

So the first module finished (yay!) is metadata.py and it has several places where it detects and logs errors. So how, in the matching metadata_test.py, can I test whether the module wrote the expected thing to the log?

There may be better ways, but this is how I'm doing it. First, at the top of the test module is this, which I expect will be boilerplate repeated in every test driver.

# set up logging to a stream
import io
log_stream = io.StringIO()
import logging
logging.basicConfig(stream=log_stream,level=logging.INFO)
def check_log(text):
    global log_stream
    "check that the log_stream contains text, rewind the log, return T/F"
    log_data = log_stream.getvalue()
    x = log_stream.seek(0)
    x = log_stream.truncate()
    return (-1 < log_data.find(text))

During execution of the unit test, log output is directed to an in-memory stream. In the test code, the module under test is provoked into seeing an error that should cause it to write a log line. Then you can just code assert check_log('some text the test should have logged'). The assertion fails if the string isn't in the log. If it succeeds, execution continues with the log cleared out for the next test.

Looking at it now, I think maybe check_log() should take two parameters, the text and the level, so as to verify that the message is at the expected level:

assert check_log('whatever',logging.WARN)

I'll leave that as an exercise. Meaning, I'm too lazy to do it now.

Incidentally, another goal of V2 is to have localized (e.g. translated) text in the visible UI. Perhaps log messages should also be translated but... nah.

Friday, February 14, 2014

Converting PPQT: which RE lib?

I'm working on a lengthy project to make version 2 of PPQT, a large Python/Qt app. I'm documenting some of the things I learn in occasional blog posts.

PPQT 1 makes frequent use of regular expressions, mostly using Qt's QRegExp class. That has to change for two reasons. One is that QRegExp falls quite a bit short of PCRE compatibility. Qt5 includes a new class, QRegularExpression, which does claim PCRE compatibility as well as performance, so at least I want to convert the old ones to the longer-named type.

However, one big difference from PyQt4 to 5 is the "new API" that abolishes use of QString. In PyQt4 many class methods take, or return, QStrings, and PPQT uses lots of QString objects. QStrings and QRegExps work well together; QRegExp.indexIn() takes a QString, and QString.find() takes a QRegExp.

In PyQt5, all classes that (in the C++ documentation) take or return a QString, now take or return a simple Python string value, with PyQt5 doing automatic conversion. There is no "QString" class in PyQt5—at all! That means there is no way to call QString.find(), and if you call QRegExp.indexIn(string), there will be a hidden conversion from Python to QString. Which means—why use Qt regexes at all? Since all program-accessible strings are Python strings, why not use Python's own regular expression support?

Standard Python support is the "re" lib. It also is not PCRE compatible (although closer than QRegExp) and not known for speed. But there is another: the "regex" module, which intends to become the Python standard but now is an optional install. It is PCRE-compatible, with the Unicode property searches and Unicode case-folding that are lacking in QRegExp and in the re module. It actually adds more functionality, including "fuzzy" matches that could be very useful to me in PPQT. The class and method names are the same as the standard re module.

Code Changes

One design difference between Python's re/regex and QRegularExpression on one side, and the QRegExp that PPQT 1 uses so many of on the other, will cause some code changes.

An instance of QRegExp is not reentrant: when it is used, it stores information about the match position and capture groups in the regex object. Such an object shouldn't be a global or class variable shared between instances of a class, because activity in one using method could overwrite a match found from another. But based on its design, PPQT 1 had frequent uses like this:

    if 0 <= re_object.indexIn(string):
        cap1 = re_object.cap(1)

Both re/regex and QRegularExpression take a different approach: the regex object knows about the search pattern but is otherwise immutable. When you perform a search with it it, it returns a match object that encodes the positions and lengths of the matched and captured substrings. The regex object can be a global; every using method gets its private match object to work with. However, code like that above has to be rewritten (using Python re/regex) as:

    match = re_object.match(string)
    if match : # i.e. result was not None
        cap1 = match.cap(1)

Python re/relib match returns None on failure, or a match object. None evaluates as False, so "if match" is equivalent to "if a match was found." The returned value of a QRegularExpression object is always a (take a deep breath) QRegularExpressionMatch object, so the equivalent would be:

    match = re_object.match(string)
    if match.isValid() : # match succeeded
        cap1 = match.captured(1)

Not only is this many more keystrokes to write, it entails two pointless auto-conversions between Python and Qt string types: from Python to Qstring in the match() call, and from QString to Python in the captured(1). All told, the Python relib seems a better choice and I plan to use it exclusively in PPQT 2.