Tuesday, March 24, 2015

Wrapping up table sort experiments

Yesterday I speculated that perhaps the basic database didn't need to be a SortedDict, because it is always to be accessed through a sort vector. However, that is not the case. The reason is that the actual access to the database is by indexing a values-view on it. The base Python (Python 3) dict() class supports a values view; it is returned by dictname.values(). However, in this case, the view is only an iterator. It can be used to process the dict values as in for v in dictname.values() but it cannot be indexed, as in dictname.values()[j]. That is an added ability of the SortedDict class. So, even though its natural ordering will not be used, the base dict has to be a SortedDict.

I added a filtering capability to the test-bed, with a popup menu to select a filter on the flags column. The final program looks like this:

The little popup has four choices. Choice 0 is "All", no filtering. The others request an X in a column of the flag value, for instance '.X.' selects only rows where the flag field has an X in the second position. Selecting one of these triggers a signal into this slot:

    filters = [
        None,
        lambda w, c, f: f[0]=='X',
        lambda w, c, f: f[1]=='X',
        lambda w, c, f: f[2]=='X'
        ]
    def set_filter(self, choice) :
        self.table_model.filter_func = self.filters[ choice ]
        self.table_model.do_sort()

The table model code then executes a sort, which centers on this:

self.sort_vector = get_column_vector( col, order, self.sort_key_func, self.filter_func )

And that leads to executing this:

KEY_GETTERS = [
    lambda j : VALUES_BY_ROW[j][0],
    lambda j : '{:05}{}'.format( VALUES_BY_ROW[j][1], VALUES_BY_ROW[j][0] ),
    lambda j : '{}{}'.format( VALUES_BY_ROW[j][2], VALUES_BY_ROW[j][0] )
    ]
def get_column_vector( col, order, key_func = None, filter_func = None ):
    global DATABASE, VALUES_BY_ROW, SORT_UP_DICTS, SORT_UP_VECTORS, SORT_DOWN_VECTORS
    if filter_func is None :
snip...
    else :
        # Build a filtered vector for the requested column, order and filter.
        getter_func = KEY_GETTERS[ col ]
        sort_dict = SortedDict( key_func )
        for j in range( len(DATABASE) ) :
            if filter_func( *VALUES_BY_ROW[j] ) :
                k = getter_func( j )
                sort_dict[ k ] = j
        vector = sort_dict.values()
        if order == Qt.DescendingOrder : # convert the up-vector to a down one
            vector = [ j for j in reversed( vector ) ]
    return vector

I ran into one misunderstanding about Python syntax. The core expression above is if filter_func(*VALUES_BY_ROW[j])... This selects the value of row j, which is a three-element tuple, and passes it into the given function. Initially I supposed I could write that function as, e.g.,lambda (w,c,f): test-on-f, expecting the three parts of the input tuple would be assigned to the three local names; but that is a syntax error. How does one pass the elements of a tuple into a lambda (or any other function)? One uses a leading asterisk to unpack it:

filter_func(VALUES_BY_ROW[j]) passes a single value which is a tuple. filter_func receives one value, a tuple.

filter_func(*VALUES_BY_ROW[j]) passes the elements of the tuple as individual items. filter_func receives, in this case, three distinct values, so it can be coded lambda w,c,f : test-on-f.

The code is here in PasteBin. You can execute it from the command line, giving as an argument the number of rows in the fake database, e.g. python table-ownsort.py 20000. I tested its operation up to 64,000 rows. At more typical sizes of 10K-20K it takes 1-2 seconds to sort a column for the first time, and negligible time to change the sort order thereafter. The growth in time appears to be just about linear in the number of rows:

 2000 : 0.099
 4000 : 0.209
 8000 : 0.422
16000 : 0.856
32000 : 1.749
64000 : 3.394

That is a huge improvement over the sorting time of QSortFilterProxyModel, which was roughly exponential and nearly unusable at 10000 rows.

The next big job is to retrofit this whole mechanism of sort vectors back into the existing PPQT code. I may get a start on that if there's some free time in Oklahoma City (and hey, after one visits the National Cowboy Museum and the Myriad Botanical Gardens, there is not a lot more to do...)

No comments: