Tuesday, March 17, 2015

Implementing sort() in a table model

Done, and well done. I believe I will start with results and then go over the key bits of the code. Now that I control the sort() method I can time it, so I added a label at the top of the window to display the time spent in that function. Here is the window after "refreshing" (loading) the fake database, clicking in the first column head to make it sort descending, and then refreshing again.

Using sortFilterProxy in this sequence resulted in a model-reset time in the tens of seconds and nearly a million calls to the data() method. Here it is 14 milliseconds. Of that, 3 milliseconds are spent in the sort (top right). This 3ms penalty to do the first sort-descending on column 0 is caused, as we will see, by building a list of 10000 integers.

Once that penalty has been paid, however, re-sorting on the same column takes negligible time. Repeated clicks on the head of column 0 come up with sort times around 90-150μs.

There is a larger penalty to be paid when one sorts on another column. Clicking once on the head of the second column takes a hefty 0.4 second:

Clicking again to sort that column descending, gets a lesser penalty of 15ms.

But from then on, repeated clicking on either column happens in minimal time.

The story is similar with the third column: half a second overhead for the first sort, 15-20ms for the descending sort, then tiny delays. Resetting the database causes all these sort penalties to be incurred again, of course. But the situation is just immensely, massively better than before.

Let's look at the code which can be seen in Pastebin. It is basically the original test program with some extra code, and the names of the key globals made long and explicit. The first thing to look at is the initialization of the table model.

class Model( QAbstractTableModel ):
    tableSorted = pyqtSignal()
    def __init__ ( self, parent=None ) :
        super().__init__( parent )
        clear_the_db()
        self.access_counts = [0, 0, 0]
        self.sort_order = Qt.AscendingOrder
        self.sort_column = 0
        self.sort_vector = []
        self.sort_time = 0.0 # TIMING

It has acquired a signal, which is emitted at the end of any sort so the main window can update the sort-time label. Three members have been added: sort_order and sort_column retain the column and order last used. Following a Refresh, that sort is applied again. The sort_vector member is a reference to a list of integers used in the data() method to transform the row number. Let's look at data().

    def data(self, index, role ) :
        global VALUES_BY_ROW
        if role != Qt.DisplayRole :
            return None
        row = index.row()
        col = index.column()
        self.access_counts[col] += 1
        sort_row = self.sort_vector[ row ]
        return VALUES_BY_ROW[ sort_row ][ col ]

The only change from before is that it translates the requested row number through the sort_vector before using it. That's it for sorting, finished, let's go for coffee?

OK, not quite, there is the sort() method to read, also.

    def sort( self, col, order ) :
        t0 = time.process_time() # TIMING
        self.sort_column = col
        self.sort_order = order
        self.sort_vector = get_column_vector( col, order )
        self.layoutChanged.emit()
        self.sort_time = time.process_time() - t0    # TIMING
        self.tableSorted.emit()    # TIMING

It saves the column and order for use after a refresh. Then it calls to some mysterious get_column_vector() routine to get a translation vector to put in the sort_vector member for data() to use.

Also it emits the layoutChanged signal. This tells the table view it needs to repopulate itself. Yesterday I said I presumed that the table view would know, if it had to call sort(), that it should repopulate. Wrong. It's not that smart. So after clicking the column head, nothing would change until one scrolled a little, then it would repaint correctly. The key turned out to be this signal, which makes it repaint instantly.

Anyway, that's it for table sorting, finished! Anyone for beer?

OK, no, there is the question of how the sort vectors are created. These are the operations that are taking up to half a second when one first sorts on a column other than the primary-key column.

There is a whole lot of commentary in the full source file linked above so I won't repeat it all. Rebuilding the database is exactly as before: a SortedDict is loaded with 10000 rows of fake data in 3 columns. A valuesView on it is placed in VALUES_BY_ROW for use from the data() method. The main addition is a mechanism of sort vectors.

SORT_UP_VECTORS = [ None, None, None ]
SORT_DOWN_VECTORS = [ None, None, None ]
SORT_UP_DICTS = [ None, None, None ] # only 1 & 2 used

def get_column_vector( col, order ):
    global DATABASE, VALUES_BY_ROW, SORT_UP_VECTORS, SORT_DOWN_VECTORS
    vector = SORT_UP_VECTORS[ col ] if order == Qt.AscendingOrder else SORT_DOWN_VECTORS[ col ]
    if vector is None : # we need to make one
        if SORT_UP_VECTORS[ col ] is None : # we have no up-vector
            # build a format string for col 1 int:str or col 2 str:str
            key_format = '{:05}{}' if col == 1 else '{}{}'
            sort_dict = SortedDict()
            for j in range( len(DATABASE) ) :
                k = key_format.format( VALUES_BY_ROW[j][col], VALUES_BY_ROW[j][0] )
                sort_dict[k] = j
            SORT_UP_DICTS[ col ] = sort_dict # save the dict itself
            SORT_UP_VECTORS[ col ] = sort_dict.values()
        vector = SORT_UP_VECTORS[ col ] # be optimistic
        # We have a sort-up vector, or was it sort-down we need?
        if order == Qt.DescendingOrder : # ok, make the reverse
            SORT_DOWN_VECTORS[ col ] = [j for j in reversed( SORT_UP_VECTORS[ col ] ) ]
            vector = SORT_DOWN_VECTORS [ col ]
    return vector

This is where the magic happens, and I think it's rather clever. The caller wants a vector of index numbers that will return the rows of VALUES_BY_ROW in some sequence. There are six possible sequences (three columns, two orders). Initially only the up-vector for column 0 exists; the others are None. As they are requested by the user, they are built and cached (I believe "memoized" is the trendy term?) in SORT_UP_VECTORS and SORT_DOWN_VECTORS. On first entry we pick up the up or down vector requested. If it is not None, we're finished, return it (sort time 150μs).

The sort-down vectors are made from the sort-up vectors. So the next test, regardless of the requested order, is to see if we have a sort-up vector for this column. If not, we have to build it.

A sort-up vector is a list of indices to VALUES_BY_ROW. To make that list we build a new SortedDict. It has as many items as the database. Each key is a string built from the value of that row of the column being sorted. The primary key (the "word" from column 0) is appended so the key is unique.

The value of each item is the index to that row. The SortedDict automatically puts all the keys in sequence. So if we read out the values of the SortedDict in sequence, they comprise a sort vector that selects the rows of the database in sequence by the sorted column.

A valuesView is exactly that, a way to read out values in key sequence. Because a valuesView is indexable, it behaves exactly like a list (strictly speaking, more like a real big tuple) of integers. It is this valuesView that is returned as a sort vector for an ascending sort.

If the sort is descending, a little more is needed. A descending sort is simply the ascending-sort vector reversed. The Python reversed() method returns an iterator; we use it to generate the full list of 10000 sort indexes in backward order, and return that.

Revisiting the examples at the top of this post, the first example shows 0.003 seconds to sort column 0 descending. That's the time to run [j for j in reversed(SORT_UP_VECTORS[col])] to create the reversed sort for that column. Then repeated clicks on column 0 just swap the references to the two lists.

The first click on column 1 incurs the 0.4 second penalty. That's the time needed to convert 10000 integers to strings, concatenate them to their "word" keys, and build a SortedDict from them. The second click on that column costs 155ms; that's the time to read through that dict's values and build the reversed list.

OK, that's it for sorting. I'm quite pleased with this result, but a number of issues remain. Not least is the job of taking these concepts back into the existing app and applying them there. The same ideas will work but the code may get messy. Edit: another issue is that this sort is not "locale-aware"; á will sort far away from a. Need to find out how to make SortedDict do locale-aware sorting!

Another important issue is that of filtering. I need to add code to this test-bed to filter the table on the values in column 2. In the existing app, I do that using the filterAcceptsRow() method of the sortFilterProxy. I really don't know how I'll accomplish it when the proxy is gone. That will be my Thursday job. See you then.

No comments: