Thursday, April 2, 2015

A darn good day's work

Monday and Tuesday I implemented the production of sort vectors in worddata, basically recreating what I had done in my table sorting testbed. Today I made the many detail changes in wordview to remove its QSortFilterProxyModel and perform all the sorting and filtering with my own code. Many many little changes. But now it is working nicely. When I opened a 1.5MB text file that uses over 13,000 unique words, it took just over 2 seconds to do the "Refresh" operation. That includes scanning the whole file and counting and categorizing the words, plus doing the initial ascending sort on column 0. Subsequent sort actions that need a new vector, that is, clicking on the head of column 2 or column 3, take about a second. Sort actions that can re-use a cached vector, for example going back to ascending sort on column 0 after sorting on a different column or different order, are effectively instantaneous.

In the course of this I added support for the Home and End keys, so I could quickly pop to the top or bottom of the table.

I also found an old bug, one that probably is present in version 1 as well. Part of the Refresh is to note the properties of a word-token. The logic went something like this:

    if the word contains an apostrophe,
        note the AP property
        strip out the apostrophe(s)
    if the word contains a hyphen,
        note the HY property
        strip out the hyphen(s)
    if not word.isalpha() : # word contains some digits
        note the ND property
    ...

The problem here is that I was assuming that a return of False from the Python str.isalpha() string method meant there were digits in the word. Not so; it means there is some character in the word that does not have the Unicode Letter property. Well, I'd removed hyphens and apostrophes; the only non-Letters would be digits, no?

No! I was forgetting the DP convention of representing non-ASCII characters with bracket notation, for example [~n] for ñ or [c,] for ç. So a legitimate word could contain several non-letters and still not be numeric. I had to change the logic to look specifically for digit characters.

I also ran into an unexpected problem with the natsort package, which I posed as an issue at the package's github site, and was pleased when the maintainer got back to me in less than an hour. I need to up my support game. I've had this kind of super-responsive support before from amateur maintainers and it is so satisfying. I must try to remember that and do as well.

Anyway, next up is upgrading my laptop to Yosemite level, I've been putting that off, and to upgrade my desktop system to [Py]Qt5.4.1. And then, one more try to see if Nuitka can package PPQT. Next week will be devoted to packaging up the alpha level of PPQT2 for Mac, Linux and Windows -- using Nuitka if possible, else PyInstaller.

No comments: