Tuesday, March 24, 2015

Wrapping up table sort experiments

Yesterday I speculated that perhaps the basic database didn't need to be a SortedDict, because it is always to be accessed through a sort vector. However, that is not the case. The reason is that the actual access to the database is by indexing a values-view on it. The base Python (Python 3) dict() class supports a values view; it is returned by dictname.values(). However, in this case, the view is only an iterator. It can be used to process the dict values as in for v in dictname.values() but it cannot be indexed, as in dictname.values()[j]. That is an added ability of the SortedDict class. So, even though its natural ordering will not be used, the base dict has to be a SortedDict.

I added a filtering capability to the test-bed, with a popup menu to select a filter on the flags column. The final program looks like this:

The little popup has four choices. Choice 0 is "All", no filtering. The others request an X in a column of the flag value, for instance '.X.' selects only rows where the flag field has an X in the second position. Selecting one of these triggers a signal into this slot:

    filters = [
        None,
        lambda w, c, f: f[0]=='X',
        lambda w, c, f: f[1]=='X',
        lambda w, c, f: f[2]=='X'
        ]
    def set_filter(self, choice) :
        self.table_model.filter_func = self.filters[ choice ]
        self.table_model.do_sort()

The table model code then executes a sort, which centers on this:

self.sort_vector = get_column_vector( col, order, self.sort_key_func, self.filter_func )

And that leads to executing this:

KEY_GETTERS = [
    lambda j : VALUES_BY_ROW[j][0],
    lambda j : '{:05}{}'.format( VALUES_BY_ROW[j][1], VALUES_BY_ROW[j][0] ),
    lambda j : '{}{}'.format( VALUES_BY_ROW[j][2], VALUES_BY_ROW[j][0] )
    ]
def get_column_vector( col, order, key_func = None, filter_func = None ):
    global DATABASE, VALUES_BY_ROW, SORT_UP_DICTS, SORT_UP_VECTORS, SORT_DOWN_VECTORS
    if filter_func is None :
snip...
    else :
        # Build a filtered vector for the requested column, order and filter.
        getter_func = KEY_GETTERS[ col ]
        sort_dict = SortedDict( key_func )
        for j in range( len(DATABASE) ) :
            if filter_func( *VALUES_BY_ROW[j] ) :
                k = getter_func( j )
                sort_dict[ k ] = j
        vector = sort_dict.values()
        if order == Qt.DescendingOrder : # convert the up-vector to a down one
            vector = [ j for j in reversed( vector ) ]
    return vector

I ran into one misunderstanding about Python syntax. The core expression above is if filter_func(*VALUES_BY_ROW[j])... This selects the value of row j, which is a three-element tuple, and passes it into the given function. Initially I supposed I could write that function as, e.g.,lambda (w,c,f): test-on-f, expecting the three parts of the input tuple would be assigned to the three local names; but that is a syntax error. How does one pass the elements of a tuple into a lambda (or any other function)? One uses a leading asterisk to unpack it:

filter_func(VALUES_BY_ROW[j]) passes a single value which is a tuple. filter_func receives one value, a tuple.

filter_func(*VALUES_BY_ROW[j]) passes the elements of the tuple as individual items. filter_func receives, in this case, three distinct values, so it can be coded lambda w,c,f : test-on-f.

The code is here in PasteBin. You can execute it from the command line, giving as an argument the number of rows in the fake database, e.g. python table-ownsort.py 20000. I tested its operation up to 64,000 rows. At more typical sizes of 10K-20K it takes 1-2 seconds to sort a column for the first time, and negligible time to change the sort order thereafter. The growth in time appears to be just about linear in the number of rows:

 2000 : 0.099
 4000 : 0.209
 8000 : 0.422
16000 : 0.856
32000 : 1.749
64000 : 3.394

That is a huge improvement over the sorting time of QSortFilterProxyModel, which was roughly exponential and nearly unusable at 10000 rows.

The next big job is to retrofit this whole mechanism of sort vectors back into the existing PPQT code. I may get a start on that if there's some free time in Oklahoma City (and hey, after one visits the National Cowboy Museum and the Myriad Botanical Gardens, there is not a lot more to do...)

Monday, March 23, 2015

Filtering adds some issues

"And now on to filtering..." Um, not so fast, cowboy. There are some issues here.

First issue arises because I had been using QSFPM for two different kinds of filtering. The QSFPM lets you filter the rows of a table in two ways. One way is that you can simply set filterKeyColumn to a column number and plug a QRegExp into filterRegExp. Then the parent QSFPM object will apply that regex to the display value of that column in every row. If there's a match, the row is accepted for display; if not, not. I was using that method for all the filtering on a word's properties. Each regex tested the properties column for some combination of flags. So if the user wanted to filter on all-cap words, the regex looked for an "A" in that flag field. Words that didn't have the "A" for all-cap property were not displayed. It was simple and clean.

The other way to filter is to code a custom version of the filterAcceptsRow() method. It receives a row index and does whatever tests it wants—presumably calling the data() method for column values to examine—and returns True or False to show whether the table should display the row or not. I used this for the case where the user wants to display the Similar, or First- or Second-harmonic word sets. The user right-clicks on a word and selects, say, First-harmonic from the context menu. My code runs through the database (directly, not going through the overhead of the model data() method) and puts every first-harmonic word in a Python set. (In version 1 I used a Python Levenshtein module for this. In V2 I am taking advantage of the fact that the superb regex package has a generalized version of 1st and 2nd harmonic matching built-in, so the test is done in C code.) Anyway, the set of selected words is saved. The QSFPM filterRegExp business is cleared out and filterAcceptsRow() takes over. It tests the word in any row for membership in the current set of words. So only the similar, or 1st- or 2nd-harmonic set of words is displayed.

I sat and stared quite a while at the table sort test code, slowly realizing that both kinds of filter would have to be somehow integrated into get_column_vector(). I had been supposing that the caller (that's the sort() method of the model) would pass in an executable, a reference to an equivalent to filterAcceptsRow(). Then the central loop of get_column_vector() would look something like this:

            # Sweep through the actual database collecting keys and relating them
            # to their index position.
            for j in range( len(DATABASE) ) :
                if filter_func( VALUES_BY_ROW[j] ) :
                    k = key_getter( j )
                    sort_dict[ k ] = j

So that only the filtered rows would be entered into the sort vector. And something like this will still likely happen. However, I was realizing that such a filter_func might be quite complex. More time is being piled onto the central loop of a time-critical function. And the model would have to have quite a selection of filter_funcs and choose the correct one. It was looking a bit messy, always a bad sign.

But there's a more serious problem. The table-sorting I've devised so far relies heavily on caching those sort-vectors. It's time-consuming to build one, a solid 1-second delay. So if you first click on a column head there's that delay. If you click a second time to invert the order, there's only a 0.1 second delay while the reverse vector is built from the cached sort-up vector. From then on clicking on that column reverses the sort in a few milliseconds because it is just grabbing the cached vector.

But filtering invalidates the cache! If I come into get_column_vector() with a filter_func parameter, it can only use the cached vector if that vector was made with the identical filter_func.

Say the table is displaying all rows, sorted up (the typical and default condition). Now the user wants to see the all-cap words. So we make a new vector with just those rows. Then the user says, wait, I really want to see the mixed-case rows, and selects that filter. We must not re-use the prior vector made with a different filter_func. We have to build a new one.

But filtering is only temporary; the user soon wants to go back to the full table. If we discard the previous, all-rows vector, we'll incur a big delay rebuilding it. So I don't want to lose any of the non-filtered vectors that may have been cached. So now I need to treat these sort vectors separately. The cached all-row vectors are saved, but they can be superceded by a single, recent, filter-based vector. Whenever the user chooses a new filter, a single, new, filtered vector is built. (We will not attempt to cache filtered vectors. Whenever the user chooses a filter, we build a new vector.) When the user goes back to the "All" filter, a previous cached all-rows vector can be used.

Oh, here's a nasty thought. The table is sorted on column 0 descending (say). The user requests a filter. The table is still sorted on column 0 descending, but now displays fewer rows. While it is still displaying a filtered set, the user clicks on a different column. So we have to build a new, filtered vector on a different column or different sort order.

No, that's OK. The mechanism of applying a filter is this: One, the user triggers a UI element requesting a filter. That can be the combo-box listing the property filters, or it can be a context menu selection. Two, the code that handles that event has the job of setting the current filter function, and then Three, it calls the sort() method. So filtering is always a sort, using the current sort column, the current sort order, and the latest filter-func. And if the user clicks on a different column while filtering, that just means another call to sort() with a different sort column, and/or a different order, but the filter-func is still there.

What does a filter-func look like? Probably they can be lambdas. Supposing there is an array of compile regexes, one for each property class, the property funcs are like lambda (w, c, f) : self.property_regex[1].match(f). And if the set of harmonic words is stored in self.filter_set, that filter is lambda (w, c, f) : w in self.filter_set.

The Stanford women won this afternoon, a very exciting come-from-behind win, so we will follow them to the next NCAA round, meaning we will be spending a long weekend in Oklahoma City this weekend. That will slow down coding a bit, perhaps. I'll try to get some of the above implemented tomorrow, anyway.

Implementing locale aware sorting

So first thing I set out to implement the locale-aware sort in my test program. And right away realized that some changes were needed.

The nut of the matter is to use a key-generator function while building the SortedDict, so as to ensure that the items are stored in the proper order. However, that turned out not to be possible.

The basic database is built on a primary key (a concept I know well from my days documenting the SQL at Informix) and a row of data. This is implemented as a SortedDict. In the case of the Words panel the primary key is a word-token (it may include hyphens and/or apostrophes), and the data are its count and its properties.

It's all very nice knowing that the SortedDict keeps the word tokens in a default ascending sorted order, but what if that order is not the order the View wants to display? I was using QSFPM to solve this, but now I have to address the question directly. What if the View (representing the user) wants the data sorted ignoring case? Or locale-aware in (for example) the de_DE locale? I don't want to have to propagate that knowledge down to where the database is built.

In my prior trial, I had assumed that the default sort imposed by SortedDict was the correct basic ascending-order sort. So I used a simple range(len(database)) as the default sort vector. But that assumption is bad. The default database order can't be trusted to be what the default View wants to display.

For non-default orders—descending order, other column order—I interposed a sort vector. Now I realized I need to interpose a sort vector even for the simple primary key ascending sort.

Well, that turned out not to be too difficult. I required that the sort() method of the model must be called when the database is refreshed. Not a problem, that was already happening. And, I initialized my collection of sort order vectors, which previously started out with the ascending-column-0 sort in place, to all-nulls. And finally, I made the magic get_column_vector() method take a key-function argument.

I had to modify this function considerably. Before, it never had to build a sort vector for column 0; that was assumed to be made when the database was initialized. Now it had to do that. I won't post the whole program until the filtering is done, but here's how that one function looks now (it will change yet more):

def get_column_vector( col, order, key_func=None ):
    global DATABASE, VALUES_BY_ROW, SORT_UP_DICTS, SORT_UP_VECTORS, SORT_DOWN_VECTORS
    vector = SORT_UP_VECTORS[ col ] if order == Qt.AscendingOrder else SORT_DOWN_VECTORS[ col ]
    if vector is None : # we need to make one
        if SORT_UP_VECTORS[ col ] is None : # we have no up-vector
            # Create a key_getter function to extract the key for this column
            if col == 0 :
                # key for column 1 is just the primary key string
                key_getter = lambda j : VALUES_BY_ROW[j][0]
            elif col == 1 :
                # key for column 1 is nnnnnPRIMARYKEY
                key_getter = lambda j : '{:05}{}'.format( VALUES_BY_ROW[j][1], VALUES_BY_ROW[j][0] )
            else : # col == 2
                # key for column 2 is XXXXXXXXPRIMARYKEY
                key_getter = lambda j : '{}{}'.format( VALUES_BY_ROW[j][2], VALUES_BY_ROW[j][0] )
            # Create the sorteddict, it is happy with key_func of None
            sort_dict = SortedDict( key_func )
            # Sweep through the actual database collecting keys and relating them
            # to their index position.
            for j in range( len(DATABASE) ) :
                k = key_getter( j )
                sort_dict[ k ] = j
            # Store the dict for reference.
            SORT_UP_DICTS[ col ] = sort_dict
            # Cache the values-view of it for use by the view
            SORT_UP_VECTORS[ col ] = sort_dict.values()
        # We have a sort-up vector, or was it sort-down we need?
        vector = SORT_UP_VECTORS[ col ] # be optimistic
        if order == Qt.DescendingOrder : # ok, make the reverse
            SORT_DOWN_VECTORS[ col ] = [j for j in reversed( SORT_UP_VECTORS[ col ] ) ]
            vector = SORT_DOWN_VECTORS [ col ]
    return vector

This does display the random keys of the fake database in proper order, all lowercase variants of a letter together, all uppercase variants of the same letter together. Or all mixed when case is ignored. I'm thinking that now, when all the ordering is imposed by vectors, is there any need to make the bottom-level database a SortedDict at all? Maybe it can be a normal Python dict.

With the extra code, sorting now takes around 1 second for a 10,000 row database. That's still a tenth the QSFPM time, and when I actually implement it in PPQT I think I can make it faster.

Now on to filtering. The above function will get yet another parameter!

Sunday, March 22, 2015

Getting the sort right

I mentioned a couple of posts back how the natsort package wasn't behaving correctly for me. I opened an issue on the github page and got an immediate and helpful reply from the author. He underscored what he had already noted in his doc page, that Locale support in both Mac OS X and BSD seems to be broken, and a simple fix is to install the ICU package (International Componenents for Unicode) and its Python binding, PyICU. It seems natsort attempts to import PyICU and it succeeds, uses it; otherwise falls back to the native support.

I was reluctant to do this because I didn't want to have to support Yet Another Goddam Third Party Module (YAGTPM, which when pronounced, conveys my feelings) on all the intended platforms for PPQT. But in fact, it appears it is only needed on OSX. (Hopefully Linux and Windows will Just Work.) So I went ahead and did it. Here's the sequence of operations one needs, if you have HomeBrew installed, slightly expanded from the natsort author's suggestions:

brew install icu4c
CFLAGS=-I/usr/local/opt/icu4c/include
export CFLAGS
LDFLAGS=-L/usr/local/opt/icu4c/lib pip install pyicu
export LDFLAGS
pip install pyuic

The CFLAGS and LDFLAGS are to tell pip where HomeBrew leaves things.

After doing this, I tested natsort and got the desired results,

words = ['apple', 'åpple', 'Apple', 'Äpple', 'Epple', 'Èpple', 'épple', 'epple']
key_func_L = natsort.natsort_keygen( alg = natsort.ns.LOCALE )
print( sorted( words, key=key_func_L ) )

['apple', 'Apple', 'åpple', 'Äpple', 'epple', 'Epple', 'épple', 'Èpple']

key_func_L = natsort.natsort_keygen( alg = (natsort.ns.LOCALE | natsort.ns.IGNORECASE ) )
print( sorted( words, key=key_func_L ) )

['apple', 'Apple', 'åpple', 'Äpple', 'Epple', 'epple', 'épple', 'Èpple']

This is what Qt's built-in table sort does. So that's solved.

Tomorrow I am going to do several things. One, I am going to modify my table sorting test platform to use natsort as the key generator in the sorted dicts it creates. That should produce proper sorting in the table. (That might be a bit hard to see given the nature of the generated "words" but it should be visible.)

Two, I am going to add filtering to the test platform in the manner of the Word table of PPQT and make sure that my plan for doing that works.

Three, if time permits, I am going to build a better test case to prove the problem with QSFPM. I believe I know how to time exactly and only the sort operations, so it will not be necessary to display the time used for endResetModel. The idea of resetting the table model seemed to perplex the responders on the Qt forum, so I'd like to eliminate it as a consideration and just show drastically long sort times.

If I can do that in a fairly clean test case, I will file a bug on QSFPM. I've not had good results filing bugs when the demonstration case is PyQt code. But maybe it will work. But I won't spend more than a couple of hours on that.

Tuesday I hope I will be able to put what I've learned from this exercise back into PPQT, first in the loupeview table and then in the Words table.

Friday, March 20, 2015

Kvetch blog

Do I come across very negative? I don't want to be the grumpy old fart. But really, other people ought to pick up their game a bit, you know?

Take Apple's Numbers. I used Numbers to make a little old spreadsheet composed of some numbers from running my Qt table sort test. I wanted to confirm my suspicion that it was displaying a slow-down that was exponential in N. So I took some numbers and I put them in the table and then I selected the two columns of data:

100 0.336
200 0.356
400 1.366
800 1.614
1600 5.596
3200 7.129
4800 17.389
6400 29.368
7200 38.460
8000 48.738
8800 19.603
9400 22.177
10000 24.371
15000 50.324

And selected Insert > Chart > 2D Line. It seems falling-off-a-log obvious to me that I want the first column to be the X-values and the 2nd column the Y. Go out 100 on the X-axis, go up 0.366, that's point one. Go to 200, up to 0.356, point two. No?

Well, no, not according to Numbers, nor to Open Office when I tried that. In both cases the default chart treats each column as an independent data series to be plotted on the Y-axis against the integers 1-13 on the X. And I spent a gorram half hour trying to figure out how to tell the stupid software to do what I wanted. At the end of which I didn't have a chart, and felt a lot like a grumpy old fart.

Well, take a look at that data. It's neither linear nor a simple exponential. There's a big spike around 8000 rows, then 8800 is nearly equal to 4800, and a new rise starts. This probably tells something very interesting about the internals of QSortFilterProxyModel.

I spent a couple of hours figuring out how to get a look at the Qt source code, and when I found it, reading qsortfilterproxymodel.cpp. I didn't come away any wiser, really. I see where it does the actual sorting. It calls something named as std::stable_sort but I wasn't able to figure out where std was defined. Anyway it is presumably not something silly like an insertion sort.

What I did learn was that QSFPM implements everything an abstract item model does. It's a proxy for all the features a model might provide like showing and hiding columns and rows, and implementing row insertion and deletion and so on. So it is quite complicated and has to be very generalized. It was a faint hope that I'd be able to see any problem with it—you know, like a comment /* TODO this sort blows up over 1000 rows */ or something like that—and of course I didn't see any such thing.

I have figured out how, in concept, to integrate filtering with my sorting scheme. The magic function of the real database model, get_column_vector( col, order ) will take a third parameter. It can be a None, indicating "all rows", or a compiled regex to be applied against the third column. So the sort vector returned will not always have as many rows as the database offers. This seems to be alright with the Qt table view. I verified that it is calling the rowcount() method very frequently. So when the sort() routine emits the layoutChanged signal, the view will call rowcount() which will return, not a fixed global count, but len(self.sort_vector). It should all work out nicely.

There, "nicely". I'm not a grumpy old fart all the time.

Thursday, March 19, 2015

Locally frustrated

A good part of the coding part of the day was spent fixing a bad pull request I'd offered to PyInstaller. And in the course of testing that, discovered another issue. Since the last time I tried PyInstaller on PPQT2, I've added the helpview module, which calls on QtWebView, which lives in QtWebKitWidgets. Only thing, the Python3 branch of PyInstaller didn't know about QtWebKitWidgets, so I had to create a "hook" for it. Then it ran, making a working app. So that's good, but it chewed up some time.

Then I turned to the issue of making SortedDict sort its keys in a locale-aware fashion. Here I've run into a problem, or at least a difference between native Python and Qt, where for once Qt looks smarter.

Recently I added self.setSortLocaleAware(True) to the sortProxyFilter that is currently the sorting mechanism in PPQT2. And immediately it began to sort "correctly" in that the following sequence holds:

apple Apple ápple åpple Äpple bapple

In other words, the accented forms of a sort immediately after the un-accented a.

I've spent the last couple of hours trying to get Python sorted() to emulate that, without success. The default of course is not what's wanted.

sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'] )
['Apple', 'apple', 'bapple', 'Äpple', 'ápple', 'åpple']

It's basically the numerical values of the code-points. Localization is supposedly supported by the locale module, in particular, locale.strxfrm() that "Transforms a string to one that can be used in locale-aware comparisons." It can be given to sorted() as a key-function.

sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key=locale.strxfrm)
['Apple', 'apple', 'bapple', 'Äpple', 'ápple', 'åpple']

No change. Shall we try it in French?

locale.setlocale(locale.LC_ALL,'fr_FR.UTF-8')
'fr_FR.UTF-8'
sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key=locale.strxfrm)
['Apple', 'apple', 'bapple', 'Äpple', 'ápple', 'åpple']

And no change. So it doesn't seem to do anything.

The NatSort package aims to support "natural" sorting including grouping characters. It also can mint a key function to give to sorted. Let's see.

ns_key = natsort.natsort_keygen( alg = natsort.ns.LOCALE )
sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key = ns_key )
['Apple', 'apple', 'bapple', 'Äpple', 'ápple', 'åpple']
ns_key = natsort.natsort_keygen( alg = natsort.ns.GROUPLETTERS )
sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key = ns_key )
['Apple', 'apple', 'bapple', 'ápple', 'Äpple', 'åpple']

Interesting. The "group letters" option changed the order of the accented letters although not in any obviously useful way. The accented forms still sort after the b.

NatSort is supposed to support ignore-case, and this at least is a function I need. I totally forgot, in my pleasure at working out how to do my own table sorting, that I need to allow for re-sorting when the user changes the Respect Case switch. So let's see if NatSort handles that at least.

ns_key = natsort.natsort_keygen( alg = natsort.ns.IGNORECASE )
sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key = ns_key )
['apple', 'Apple', 'bapple', 'ápple', 'Äpple', 'åpple']

Yes and no; at least it now sorts a ahead of A, but it is ignoring the Unicode definition which should surely put Ä next to A; they are both "LATIN CAPITAL LETTER" types. Adding the locale-aware flag doesn't help, either:

ns_key = natsort.natsort_keygen( alg = (natsort.ns.IGNORECASE | natsort.ns.LOCALE) )
sorted( ['apple','Apple','ápple','åpple','Äpple','bapple'], key = ns_key )
['apple', 'Apple', 'bapple', 'ápple', 'Äpple', 'åpple']

This is all with the locale still set to 'fr_FR', by the way.

There is a note in the NatSort documentation to the effect that Python locale is broken on some platforms notably Mac OS X, and to install the IBM ICU package. I am not going to do this for two reasons. One, it is a huge package with lots of C++ code, so it would have to be built separately on every platform and included in a distribution package. And two, Qt is doing it right without any help.

However, I'm not sure what to do next to make my hand-made table sort to work.

Tuesday, March 17, 2015

Implementing sort() in a table model

Done, and well done. I believe I will start with results and then go over the key bits of the code. Now that I control the sort() method I can time it, so I added a label at the top of the window to display the time spent in that function. Here is the window after "refreshing" (loading) the fake database, clicking in the first column head to make it sort descending, and then refreshing again.

Using sortFilterProxy in this sequence resulted in a model-reset time in the tens of seconds and nearly a million calls to the data() method. Here it is 14 milliseconds. Of that, 3 milliseconds are spent in the sort (top right). This 3ms penalty to do the first sort-descending on column 0 is caused, as we will see, by building a list of 10000 integers.

Once that penalty has been paid, however, re-sorting on the same column takes negligible time. Repeated clicks on the head of column 0 come up with sort times around 90-150μs.

There is a larger penalty to be paid when one sorts on another column. Clicking once on the head of the second column takes a hefty 0.4 second:

Clicking again to sort that column descending, gets a lesser penalty of 15ms.

But from then on, repeated clicking on either column happens in minimal time.

The story is similar with the third column: half a second overhead for the first sort, 15-20ms for the descending sort, then tiny delays. Resetting the database causes all these sort penalties to be incurred again, of course. But the situation is just immensely, massively better than before.

Let's look at the code which can be seen in Pastebin. It is basically the original test program with some extra code, and the names of the key globals made long and explicit. The first thing to look at is the initialization of the table model.

class Model( QAbstractTableModel ):
    tableSorted = pyqtSignal()
    def __init__ ( self, parent=None ) :
        super().__init__( parent )
        clear_the_db()
        self.access_counts = [0, 0, 0]
        self.sort_order = Qt.AscendingOrder
        self.sort_column = 0
        self.sort_vector = []
        self.sort_time = 0.0 # TIMING

It has acquired a signal, which is emitted at the end of any sort so the main window can update the sort-time label. Three members have been added: sort_order and sort_column retain the column and order last used. Following a Refresh, that sort is applied again. The sort_vector member is a reference to a list of integers used in the data() method to transform the row number. Let's look at data().

    def data(self, index, role ) :
        global VALUES_BY_ROW
        if role != Qt.DisplayRole :
            return None
        row = index.row()
        col = index.column()
        self.access_counts[col] += 1
        sort_row = self.sort_vector[ row ]
        return VALUES_BY_ROW[ sort_row ][ col ]

The only change from before is that it translates the requested row number through the sort_vector before using it. That's it for sorting, finished, let's go for coffee?

OK, not quite, there is the sort() method to read, also.

    def sort( self, col, order ) :
        t0 = time.process_time() # TIMING
        self.sort_column = col
        self.sort_order = order
        self.sort_vector = get_column_vector( col, order )
        self.layoutChanged.emit()
        self.sort_time = time.process_time() - t0    # TIMING
        self.tableSorted.emit()    # TIMING

It saves the column and order for use after a refresh. Then it calls to some mysterious get_column_vector() routine to get a translation vector to put in the sort_vector member for data() to use.

Also it emits the layoutChanged signal. This tells the table view it needs to repopulate itself. Yesterday I said I presumed that the table view would know, if it had to call sort(), that it should repopulate. Wrong. It's not that smart. So after clicking the column head, nothing would change until one scrolled a little, then it would repaint correctly. The key turned out to be this signal, which makes it repaint instantly.

Anyway, that's it for table sorting, finished! Anyone for beer?

OK, no, there is the question of how the sort vectors are created. These are the operations that are taking up to half a second when one first sorts on a column other than the primary-key column.

There is a whole lot of commentary in the full source file linked above so I won't repeat it all. Rebuilding the database is exactly as before: a SortedDict is loaded with 10000 rows of fake data in 3 columns. A valuesView on it is placed in VALUES_BY_ROW for use from the data() method. The main addition is a mechanism of sort vectors.

SORT_UP_VECTORS = [ None, None, None ]
SORT_DOWN_VECTORS = [ None, None, None ]
SORT_UP_DICTS = [ None, None, None ] # only 1 & 2 used

def get_column_vector( col, order ):
    global DATABASE, VALUES_BY_ROW, SORT_UP_VECTORS, SORT_DOWN_VECTORS
    vector = SORT_UP_VECTORS[ col ] if order == Qt.AscendingOrder else SORT_DOWN_VECTORS[ col ]
    if vector is None : # we need to make one
        if SORT_UP_VECTORS[ col ] is None : # we have no up-vector
            # build a format string for col 1 int:str or col 2 str:str
            key_format = '{:05}{}' if col == 1 else '{}{}'
            sort_dict = SortedDict()
            for j in range( len(DATABASE) ) :
                k = key_format.format( VALUES_BY_ROW[j][col], VALUES_BY_ROW[j][0] )
                sort_dict[k] = j
            SORT_UP_DICTS[ col ] = sort_dict # save the dict itself
            SORT_UP_VECTORS[ col ] = sort_dict.values()
        vector = SORT_UP_VECTORS[ col ] # be optimistic
        # We have a sort-up vector, or was it sort-down we need?
        if order == Qt.DescendingOrder : # ok, make the reverse
            SORT_DOWN_VECTORS[ col ] = [j for j in reversed( SORT_UP_VECTORS[ col ] ) ]
            vector = SORT_DOWN_VECTORS [ col ]
    return vector

This is where the magic happens, and I think it's rather clever. The caller wants a vector of index numbers that will return the rows of VALUES_BY_ROW in some sequence. There are six possible sequences (three columns, two orders). Initially only the up-vector for column 0 exists; the others are None. As they are requested by the user, they are built and cached (I believe "memoized" is the trendy term?) in SORT_UP_VECTORS and SORT_DOWN_VECTORS. On first entry we pick up the up or down vector requested. If it is not None, we're finished, return it (sort time 150μs).

The sort-down vectors are made from the sort-up vectors. So the next test, regardless of the requested order, is to see if we have a sort-up vector for this column. If not, we have to build it.

A sort-up vector is a list of indices to VALUES_BY_ROW. To make that list we build a new SortedDict. It has as many items as the database. Each key is a string built from the value of that row of the column being sorted. The primary key (the "word" from column 0) is appended so the key is unique.

The value of each item is the index to that row. The SortedDict automatically puts all the keys in sequence. So if we read out the values of the SortedDict in sequence, they comprise a sort vector that selects the rows of the database in sequence by the sorted column.

A valuesView is exactly that, a way to read out values in key sequence. Because a valuesView is indexable, it behaves exactly like a list (strictly speaking, more like a real big tuple) of integers. It is this valuesView that is returned as a sort vector for an ascending sort.

If the sort is descending, a little more is needed. A descending sort is simply the ascending-sort vector reversed. The Python reversed() method returns an iterator; we use it to generate the full list of 10000 sort indexes in backward order, and return that.

Revisiting the examples at the top of this post, the first example shows 0.003 seconds to sort column 0 descending. That's the time to run [j for j in reversed(SORT_UP_VECTORS[col])] to create the reversed sort for that column. Then repeated clicks on column 0 just swap the references to the two lists.

The first click on column 1 incurs the 0.4 second penalty. That's the time needed to convert 10000 integers to strings, concatenate them to their "word" keys, and build a SortedDict from them. The second click on that column costs 155ms; that's the time to read through that dict's values and build the reversed list.

OK, that's it for sorting. I'm quite pleased with this result, but a number of issues remain. Not least is the job of taking these concepts back into the existing app and applying them there. The same ideas will work but the code may get messy. Edit: another issue is that this sort is not "locale-aware"; á will sort far away from a. Need to find out how to make SortedDict do locale-aware sorting!

Another important issue is that of filtering. I need to add code to this test-bed to filter the table on the values in column 2. In the existing app, I do that using the filterAcceptsRow() method of the sortFilterProxy. I really don't know how I'll accomplish it when the proxy is gone. That will be my Thursday job. See you then.

Sorting the table by my ownself

In the prior post, I said that what I expected a Qt sort routine to do was "...fetch all the key values into an array; sort the array using any available standard library routine; save a vector of indexes internally..." That notion was based in what I learned about sorting a long time ago. I learned about sorting when I was working on APL\360 in the early 1970s. APL has a built-in operator called grade up. In my day it only applied to vector types (Python: list types); in its modern incarnation it has been extended to handle arrays as well. Anyway, the result of applying grade-up to a vector of order-able values was a vector of integers, the origin-0 indexes that would sort the input.

      L ← 2 7 1 8 2 8
      ⍋ L
2 0 4 1 3 5
      L[ ⍋ L ]
1 2 2 7 8 8

As a side-note, APL programmers did lots of tricks with grade-up. For example, applying it to its own output produces a list of the rank orders of the input.

      L ← 2 7 1 8 2 8
      ⍋ L
2 0 4 1 3 5
      ⍋ ⍋ L
1 3 0 4 2 5

That is, the first element of L is the 2nd-lowest (1), the second element is 4th-lowest (3), the third element is the lowest, and so on. But I digress.

Since the QSortFilterProxyModel sort seems to be broken, maybe I should take the advice I got from "Andre" on the Qt forum, and implement sort myself. One reason I have been reluctant to do this is that I also use QSortFilterProxyModel to filter the table. If I toss it and do my own sort, I will have to find a way to do filtering myself.

The QAbstractTableModel inherits a sort() method which in the default form does nothing. It receives two arguments, the column number to sort and the sort direction, Qt.AscendingOrder or Qt.DescendingOrder. The docs don't say so but I am going to assume that the QTableView is smart enough to know that if it has to call model.sort() (because the user clicked a column head), there may be some change in the order of the rows, and it needs to repopulate the visible rows after the sort call.(edit: this turns out to be wrong.)

Sorting has to start with the model's data() method. When the table has been sorted, the data() method needs to access the database through an indirection vector. In APL terms, instead of returning values from L[row], it needs to return values from L[ (⍋ L)[row] ]. But how can data() know that a sort is in effect?

Well, why does it need to know? Let's make data() always go through an indirection. When the table is in the default sort order, which is the natural order of the valuesView, the indirection can be a simple range(DBS), a vector of integers from 0 to max. It will add a little overhead but otherwise not change the operation.

When sort() is called, it can swap in a different indirection vector. For example, if the sort is descending on the first column the indirection vector can be reversed(range(DBS)). It will be a little more complicated arranging for indirection vectors based on sorts of the other two columns, but the SortedDict type will be a help.

I'm posting this on Tuesday morning before I've written more than a few lines of the code to implement this concept. I can hardly wait to find out how (or if?) it is going to work. With any luck, next post will display some sorting code.

Monday, March 16, 2015

Testing table with QSortFilterProxy

With the test framework set up as described in the previous post, it is quite simple to implement real sorting in a table. The modified code is here in Pastebin. It is almost identical to the basic code.

The process of adding a proxy is described in the doc page for QSortFilterProxyModel. The proxy model object is given to the table view as its model. The proxy is given the actual table model as its source model. Here is the added code to do the job. First define the proxy class.

from PyQt5.QtCore import QSortFilterProxyModel
class Proxy( QSortFilterProxyModel ) :
    def __init__ ( self, parent=None ) :
        super().__init__( parent )
        self.setSortCaseSensitivity( True )
        self.setSortLocaleAware( True )

In the setup code in the main window, interpose it between the model and the view.

        self.table_view = View( parent=self )
        self.table_model = Model( parent=self )
        self.table_proxy = Proxy( parent=self )
        self.table_proxy.setSourceModel( self.table_model )
        self.table_view.setModel( self.table_proxy )

That's all! Now clicking column heads makes actual sorting happen.

When you run this program it looks just like the basic version. And if you do not click a column head, it performs just like the basic one as you can see in this capture.

Just as with the unsorted table, the view is only fetching the rows it needs to fill the window. The time to complete a refresh is a few milliseconds longer than the database rebuild.

Things change drastically if one simply clicks once on the head of column 1, to request a change of sort order. First, there is an agonizing pause of 9 seconds. The Mac OS "beachball" cursor is on through most of this. Then the data displays in the new sort order.

(I still do not know a good way to insert instrumentation into the process of a column sort. The QHeaderView emits a signal when the sort order changes, but not when the sort ends. Perhaps I could override the sort() method of the proxy? But it is not necessary.)

From now on, with sorting in effect, clicking Refresh causes a very different reaction than before.

I think we have a result here.

If the sort is moved to column 2 (by clicking once in the column 2 header), there is again a 9-second delay. Then clicking Refresh gets similar but not quite the same results:

The number of data() calls is lower, possibly reflecting the presence of a significant number of duplicate values in column 2. The elapsed time is proportional to the fewer calls. But the numbers are still extraordinarily big.

To me there is no way to account for the huge number of calls to the model data() function. A sane implementation would call 10,000 times to fetch all the key values into an array; sort the array using any available standard library routine; save a vector of indexes internally; and make a further 12 calls to fill up the display. Total, 10012 calls to the sorted column and 12 to the others.

That is clearly not what the proxy code is doing. A standard sort should make O(N*log2(N)) comparisons; in this case, about 133,000 comparisons. Even if we suppose that every comparison involved fetching two values from the model—which it certainly should not—that would still be only about 266,000 calls. I simply cannot image what the proxy is doing to require so many calls to the model.

Well, whatever it is doing, it is not doing it because I mis-coded something. This is absolutely minimal code. (Yes, I tried turning off case sensitivity and local-aware sorting. It didn't make a difference to the sorting in column 1, the words. It might have slightly speeded sorting column 2, integers.)

In the next post, I will explore ways to do my own sorting, taking advantage of the abilities of the SortedContainer package.

Basic table testing framework

To test the table sorting mechanism in a way that was somewhat relevant to my app, I wrote a program that generates a fake database of 10,000 rows with contents that are similar to the PPQT Word table. The code for the program is stored in Pastebin. I will examine parts of it here.

The program starts with several functions to generate random data:

  • xdi() returns a list of a given number of random integers from the exponential distribution. I use the Beta value to skew the list in certain ways. For example a Beta of 0.05 and a max of 100 gives a list primarily of small integers. This emulates the distribution of word-counts in a vocabulary: many 1s and 2s and a few 50s or 100s.
  • rwt() returns a word-like token with, typically, more vowels than consonants.
  • rft() returns a token that looks something like the property flag values of the Word table.

You can pretty much ignore the code of these; they work and that's all I care about.

The database is a SortedDict from the excellent SortedContainers package (that link is to its doc; here it is on PyPI). The SortedContainers package is pure Python, minimizing cross-platform issues, and has performance as good as, and sometimes better than, packages based on C code such as blist. The database is populated with this code:

    col2 = xdi( 0.05, DBS, 100 )
    clear_the_db()
    for k in col2 :
        w = rwt(8)
        f = rft()
        DBM[ w ] = ( w, k, f )
    DBV = DBM.values()

DBS contains 10000. So this makes a dict with 10,000 items, with keys of 8-char random words. The dict is sorted. The values view gives indexed access to the sequence of values. That is, DBV[j] is the jth value in key sequence.

This code, we will see later, takes about 0.7 seconds to complete on a 3.5GHz Core i5. Lotsa computation going on.

You might notice a suspicious redundancy. The primary key (in the SQL sense) is w, and it is being stored both as the dict key and again in the tuple that is is the dict value. There is a reason for this.

The rest of the code is very standard, and minimal, PyQt5. There is a table model implemented in part as follows:

class Model( QAbstractTableModel ):
    def __init__ ( self, parent=None ) :
        super().__init__( parent )
        self.access_counts = [0, 0, 0]
 
    def rowCount( self, index ) :
        global DBM
        if index.isValid() : return 0
        return len(DBM)
  
    def data(self, index, role ) :
        global DBV
        if role != Qt.DisplayRole :
            return None
        row = index.row()
        col = index.column()
        self.access_counts[col] += 1
        return DBV[ row ][ col ]

The data() method counts the number of DisplayRole calls for each column. Its last line demonstrates the usefulness of the redundant storing of the key word in the value. Otherwise there would have to be (and initially was, until I thought of this trick) a test in the code:

        if col : # is 1 or 2,
            return DBV[ row ][ col-1 ]
        else: return DBK[row]

Return either one of two values stored as the dict value, or return the key. But storing the key redundantly in the value allows the data() method to eliminate one test. And with a multi-gigabyte laptop, who cares about a little redundancy? 10,000 extra string values? What's that, 0.00001 of a gig? Pfffffttt!

Below the table model comes the lengthy code needed to create a basic window and populate it with stuff. The only interesting bit is the slot attached to the clicked signal of the Refresh button:

    def do_refresh( self ) :
        self.table_model.clear_counts()
        self.table_model.beginResetModel()
        self.times[0] = time.process_time()
        rebuild_the_db()
        self.times[1] = time.process_time()
        self.table_model.endResetModel()
        QApplication.processEvents()
        self.times[2] = time.process_time()
        self.update_labels()

The update_labels() routine displays the counts of accesses, and the elapsed times for rebuilding the database and for completing the update of the table, in labels at the bottom of the window. Here is what the window looks like, with some typical values.

The values at the bottom of the window are typical. It takes around 0.73 of a second to recreate the database. That's not long enough to justify putting up a progress-bar, although it's close. The time to complete the model reset and redisplay the table is trivial, imperceptible to the user.

The access counts are interesting. Initially they were zero. Then I put the call to QApplication.processEvents() in, and that forced the table display code to run. But even then it only called for 12 rows of data. My first thought was, why not 10,000? Then I realized, oh, it's being clever and only getting enough data to populate the visible rows in the window. And that was proven when I dragged the window taller. Then a Refresh caused it to fetch 29 rows, exactly the number that filled the larger window. Also, it always fetches only that many rows, even when I have scrolled the table to the end. It knows what row indices it needs, and calls the model only for those.

However, this table doesn't sort. The table view initializes itself with

        self.setSortingEnabled( True )
        self.sortByColumn( 0, Qt.AscendingOrder )

That is why the table shows a sort order indicator (small triangle in the Column 1 header). One can click the column headings and the sort order indicator will move around and change as if sorting were happening. And presumably the sort() method of the table model (inherited from QAbstractItemModel) is being called, but it does nothing.

In the next post, I will show what happens when we interpose a sort filter proxy.

Sunday, March 15, 2015

Building a case against QSortFilterProxy

This is part 1 of a multi-part post on speeding up the sorting of tables in Qt.

I've been disturbed by the apparent slowness of the Words table in PPQT2. It takes 15 or more seconds to refresh the table (rebuild it from scratch), and 8-10 seconds of "spinning beach-ball" cursor to re-sort on any column.

Then this week my α-test user reported that, for a large book, the bookloupe refresh was adding at least 12 seconds over the time that the bookloupe program required when run from the command line. That's consistent with the word-table delay: about a second for every 1000 rows of the table. (Not that I think the delay is a linear function of N; it might well be exponential!)

Nothing that a word-processing app does using local data, should take more than a second of elapsed time. Not on a modern computer. Maybe there are some natural-language algorithms, or neural-network or genetic problems, that might tie up a 64-bit, 2.8gigahertz CPU with eight gigabytes of RAM for substantial time. (Parenthetically, those numbers still boggle my mind; I spent so many hours at the keyboard of a 64-kilobyte CP/M system with a 2megahertz Z80...) A modern laptop should not spend any perceptible time sorting 10,000 rows of 3-column text. It does, and that means something's not right.

I did some initial fiddling around a couple months back, stuck timing statements around the Word table refresh operation and so on. From that I verified that it was taking my code much less than a second to sweep a large document and collect all the words in a vocabulary table. That included a few tens of thousands of calls to a fairly elaborate regex, and execution of another 30-odd lines of Python for each token it found.

Then my code told the QAbstractTableModel that it was time to endModelReset(), i.e. the data is all good now, redisplay the table. And that's when several seconds would elapse.

I didn't find any way to instrument the sort that happens when the user clicks on a column header. But I timed it manually at 8-10 seconds from click to redisplay of the inverted sort.

At the time I posted a query in the Qt forum. The answers I got were that, one, beginResetModel/endResetModel was a bad thing to do and should be avoided, and two, that sortFilterProxy is going to be slow and,

If you need quick sorting and you have access to the raw data in the model, you should just implement the sorting in your model itself. Just reimplement the sort function.

I wasn't very happy with this, because I thought I had revealed some pretty horrendous behavior from the sortFilterProxy. But I didn't have time to really nail the case for the prosecution. So I put the problem off. But now my user has shown the problem exists in the Loupe table also, I've gotta do something.

The first thing to do is to build a simple test-bed where I can experiment with the table code. I want to use that first to prove to myself that Qt's sort is actually slow, and not simply because I've done something stupid to slow it down. If that's the case, then I'll use that code to work out how to do my own sort.

In the next post of this sequence I will introduce that test-bed code.

Tuesday, March 10, 2015

Frittering time away

"Frittering" is a word I picked up in childhood. It must be horribly outdated now. I don't know the derivation; it just meant frivolously or thoughtlessly wasting something. Anyway I didn't really fritter my time away yesterday or today. In fact, yesterday on the plane ride from Seattle to San Jose I got some good work done cleaning up and editing my plan for the translation API. But then we were home from a ten-day trip: unpack, sort the accumulated mail, take care of neglected chores, and fall into bed. And today went just as fast and similarly: I got a 5K run in, that was a priority; and we bought our groceries for the week and got the car washed, and made a start at cleaning out the little green weeds that are trying to poke up through the mulch on our landscaped yard, and went out to have get our monthly haircuts, and I did some research on means of making our 143GB collection of photos accessible from more than just my back-room computer. And it was time to make supper.

Tomorrow is my day at the CHM cataloging (probably) more pieces of the Whirwind.

But come Thursday, by gum, it will be my priority to kill the issues that Bibimbop raised on the alpha of PPQT. You betcha.

Monday, March 2, 2015

Reinventing the wheel, cont.

A translator will be a module of Python code that has a straightforward job to do. It will be given a file-like thing into which it dumps its output, and a generator that produces semantic units parsed from the input file. It does for item in generator to the end, producing output into its file-like thing. I've got that API worked out pretty clearly. The translator's logic may be complex but it should not need to include any Python modules except perhaps regex, or unicodedata.

In particular I do not want it to need any PyQt modules. It doesn't have a UI as such.

Well, yes, but it might. In sketching out the translators I will write, especially the ASCII one, I realized that there are questions the user needs to answer. The ASCII translator (what was "Reflow" in V1) needs to know user choices such as:

  • How to treat <i> markups: delete, change to underscore, leave as-is?
  • Similar questions about <b> and <sc> markups
  • Maximum and preferred line-length

These and other user-set parameters were presented as radio buttons and similar widgets in the Reflow panel of V1. How to present them in V2, when they are input to a translator module? Any translator, I figure, may have similar questions to ask.

So my solution is to provide as basic an API for a user dialog as I can devise. When PPQT comes up, it will look for translator modules. It will use the "import machinery" to load each one into a Python namespace. This is standard stuff; PyInstaller uses exactly this mechanism to load "hook" modules. Each namespace is an object. PPQT will look in the namespace for certain global names.

One of the names a translator may define is USER_DIALOG. This is a list of dicts, each dict describing a user-input widget. For example, a translator might start out like this:

import xlate_utilities as X
max_width = {
    X.UD_TYPE: X.O_NUMBER,
    X.UD_LABEL: "Max width",
    X.UD_TOOLTIP: "Enter the maximum line width for any text.",
    X.UD_RESULT: "75"
    }
i_markup = {
    X.UD_TYPE: X.UD_CHOICE,
    X.UD_LABEL: "Convert <i>",
    X.UD_TOOLTIP: "Choose what to do if the input contains <i>: \
omit it, convert it, or pass it on",
    X.UD_RESULT: "0",
    X.UD_CHOICES: [
        ("Omit","Just omit <i> and </i> from the output"),
        ("Use _","Replace <i> and </i> with a single underscore_"),
        ("Keep","Retain <i> and </i> in the output") ]
    }
USER_DIALOG = [ max_width, i_markup ]

This code describes a dialog with two widgets. The first is a numeric spinbox; the second is a radio set with three options. When the user wants to initiate a translator, PPQT will look at the namespace it made by loading that translator. If there is a USER_DIALOG list, PPQT will dynamically assemble a modal dialog as a stack of the widgets described in the list. It will display that dialog, with OK and Cancel buttons at the bottom. The widgets will be labelled as shown, with tooltip strings as specified, and initialized from the X.UD_RESULT items. In the example, the spinner is initialized to 75; and the first, zeroth, radio button checked. If the user clicks Cancel, the translator will not be run. When OK is clicked, the values from the widgets are stored back into the X.UD_RESULT members of their respective dicts. The translator code can query them when it is executed. Since these are globals, they will still be there if the translator is run again, so the dialog will have as initial values the user's selection from the prior time.

The reinventing the wheel part is that I'm basically reinventing QtQuickWidgets in a simplified form. I looked at QtQuick and QuickWidgets before getting into this. It's much too complicated and generalized. Anyway, I absolutely do not want to impose any requirement for Qt knowledge on the writer of a translator.

Using this API, the writer of a translator can get input as numbers, yes/no checkboxes, radio sets for multi-way choices, and literal strings. The interface is static, declarative; no executable code at all.

Sunday, March 1, 2015

Reinventing the wheel

For PPQT2, a "Translator" is, or will be, a single module of Python code. It will need to be self-contained or nearly so; the only things it can import are modules that are also imported by PPQT. Those are the modules that are bundled with PPQT for distribution. But the work it has to do is such that few if any imports are needed. Basically it is a markup-to-markup transformer: it receives the semantic elements of a book text as formatted to DP standards, and it emits those elements decorated as appropriate for its output markup style, for example, HTML. Or it might be the "fp" markup whose increasing use in the DP community is what made me come up with the whole idea of modular translators in the first place.

Well, code that takes designated elements of text in and writes marked-up text out is not a new concept. Probably the best current model is Pandoc. It consists of a large repertoire of document readers, each a module that reads some style of markup and emits semantic elements in JSON format, and an equally large repertoire of writers, each reading those elements and writing a document marked up in some style. In principle Pandoc can convert just about any of the dozens of current markup styles to any other markup style.

For some time I considered using Pandoc itself. I would write a single built-in translator that would emit the Pandoc intermediate form. The user could save that as a file and then process it through Pandoc to get any other style, including LaTex and EPUB. Basically I would have added the "DP Formatting Guidelines style" as another input markup form to Pandoc.

I reluctantly decided against this approach for several reasons. First, I could not extract a clean, comprehensible specification of the Pandoc intermediate format from the Pan-documentation. Prof. MacFarlane seems to think that the code is self-documenting. The answer to any detailed question is, "read the code." Sorry, life is too short.

Second, Pandoc is written in Haskell. I certainly did not want to require my users to have to install Haskell, or to distribute Haskell myself as part of PPQT. It is not clear if Pandoc binary executables are available for my target platforms. (Everyone on the Pandoc mailing list seems to have Haskell installed.) Even if there are binaries, I don't really want to have to distribute another bulky binary, and try myself to train my users to use it.

And, finally, there are no Pandoc writer modules for two of my target markup styles, "fp" and "fpgen" from DP Canada. If a translator is in Python, there is some faint hope I can get another person to write a translator (fp and fpgen are both implemented in Python). But if writing a translator means writing in Haskell, it would all be down to me. I have no problem with functional programming—the first four years of my programming career I was in a group writing an APL interpreter, after all!—but the idea of learning Haskell and supporting modules written in Haskell, combined with the distribution problems just mentioned, put the lid on that idea's coffin.

So I decided that PPQT would take the general concept of Pandoc but implement it internally. Back in Version 1, I had code to parse a well-formed book and reduce it to elements. It was the front end to the "ASCII reflow" and "Automatic HTML" features, which I now recognize as translators. In V1, the translation process was embedded in the main program; and the translation was in-place: its output replaced the contents of the current book. The conceptual differences for V2 are three. First, the output of translation goes to make a new file instead of replacing the old one. This is enabled by the V2 ability to have multiple open books. After a successful translation, the user finds a new tab in the Edit panel. It contains the translated text, which can be inspected and saved, or discarded.

Second, there will be a clear, arms-length API between PPQT and the translators. PPQT hands a translator a file-like object to write into, and feeds it with document elements one at a time. At the end, the contents of the "file" (a MemoryStream instance) go to create the new document.

Third, the translators will be dynamically loaded. I know how to do this, even from a bundled application. At startup, PPQT looks in, I think, the Extras folder to find translators, and populates a File > Translate sub-menu with their names. When the user selects one, the fun begins.

This is all preamble to what I was going to write about. We can get to the real "reinventing the wheel" part tomorrow.