Monday, March 23, 2015

Filtering adds some issues

"And now on to filtering..." Um, not so fast, cowboy. There are some issues here.

First issue arises because I had been using QSFPM for two different kinds of filtering. The QSFPM lets you filter the rows of a table in two ways. One way is that you can simply set filterKeyColumn to a column number and plug a QRegExp into filterRegExp. Then the parent QSFPM object will apply that regex to the display value of that column in every row. If there's a match, the row is accepted for display; if not, not. I was using that method for all the filtering on a word's properties. Each regex tested the properties column for some combination of flags. So if the user wanted to filter on all-cap words, the regex looked for an "A" in that flag field. Words that didn't have the "A" for all-cap property were not displayed. It was simple and clean.

The other way to filter is to code a custom version of the filterAcceptsRow() method. It receives a row index and does whatever tests it wants—presumably calling the data() method for column values to examine—and returns True or False to show whether the table should display the row or not. I used this for the case where the user wants to display the Similar, or First- or Second-harmonic word sets. The user right-clicks on a word and selects, say, First-harmonic from the context menu. My code runs through the database (directly, not going through the overhead of the model data() method) and puts every first-harmonic word in a Python set. (In version 1 I used a Python Levenshtein module for this. In V2 I am taking advantage of the fact that the superb regex package has a generalized version of 1st and 2nd harmonic matching built-in, so the test is done in C code.) Anyway, the set of selected words is saved. The QSFPM filterRegExp business is cleared out and filterAcceptsRow() takes over. It tests the word in any row for membership in the current set of words. So only the similar, or 1st- or 2nd-harmonic set of words is displayed.

I sat and stared quite a while at the table sort test code, slowly realizing that both kinds of filter would have to be somehow integrated into get_column_vector(). I had been supposing that the caller (that's the sort() method of the model) would pass in an executable, a reference to an equivalent to filterAcceptsRow(). Then the central loop of get_column_vector() would look something like this:

            # Sweep through the actual database collecting keys and relating them
            # to their index position.
            for j in range( len(DATABASE) ) :
                if filter_func( VALUES_BY_ROW[j] ) :
                    k = key_getter( j )
                    sort_dict[ k ] = j

So that only the filtered rows would be entered into the sort vector. And something like this will still likely happen. However, I was realizing that such a filter_func might be quite complex. More time is being piled onto the central loop of a time-critical function. And the model would have to have quite a selection of filter_funcs and choose the correct one. It was looking a bit messy, always a bad sign.

But there's a more serious problem. The table-sorting I've devised so far relies heavily on caching those sort-vectors. It's time-consuming to build one, a solid 1-second delay. So if you first click on a column head there's that delay. If you click a second time to invert the order, there's only a 0.1 second delay while the reverse vector is built from the cached sort-up vector. From then on clicking on that column reverses the sort in a few milliseconds because it is just grabbing the cached vector.

But filtering invalidates the cache! If I come into get_column_vector() with a filter_func parameter, it can only use the cached vector if that vector was made with the identical filter_func.

Say the table is displaying all rows, sorted up (the typical and default condition). Now the user wants to see the all-cap words. So we make a new vector with just those rows. Then the user says, wait, I really want to see the mixed-case rows, and selects that filter. We must not re-use the prior vector made with a different filter_func. We have to build a new one.

But filtering is only temporary; the user soon wants to go back to the full table. If we discard the previous, all-rows vector, we'll incur a big delay rebuilding it. So I don't want to lose any of the non-filtered vectors that may have been cached. So now I need to treat these sort vectors separately. The cached all-row vectors are saved, but they can be superceded by a single, recent, filter-based vector. Whenever the user chooses a new filter, a single, new, filtered vector is built. (We will not attempt to cache filtered vectors. Whenever the user chooses a filter, we build a new vector.) When the user goes back to the "All" filter, a previous cached all-rows vector can be used.

Oh, here's a nasty thought. The table is sorted on column 0 descending (say). The user requests a filter. The table is still sorted on column 0 descending, but now displays fewer rows. While it is still displaying a filtered set, the user clicks on a different column. So we have to build a new, filtered vector on a different column or different sort order.

No, that's OK. The mechanism of applying a filter is this: One, the user triggers a UI element requesting a filter. That can be the combo-box listing the property filters, or it can be a context menu selection. Two, the code that handles that event has the job of setting the current filter function, and then Three, it calls the sort() method. So filtering is always a sort, using the current sort column, the current sort order, and the latest filter-func. And if the user clicks on a different column while filtering, that just means another call to sort() with a different sort column, and/or a different order, but the filter-func is still there.

What does a filter-func look like? Probably they can be lambdas. Supposing there is an array of compile regexes, one for each property class, the property funcs are like lambda (w, c, f) : self.property_regex[1].match(f). And if the set of harmonic words is stored in self.filter_set, that filter is lambda (w, c, f) : w in self.filter_set.

The Stanford women won this afternoon, a very exciting come-from-behind win, so we will follow them to the next NCAA round, meaning we will be spending a long weekend in Oklahoma City this weekend. That will slow down coding a bit, perhaps. I'll try to get some of the above implemented tomorrow, anyway.

No comments: