Monday, March 16, 2015

Basic table testing framework

To test the table sorting mechanism in a way that was somewhat relevant to my app, I wrote a program that generates a fake database of 10,000 rows with contents that are similar to the PPQT Word table. The code for the program is stored in Pastebin. I will examine parts of it here.

The program starts with several functions to generate random data:

  • xdi() returns a list of a given number of random integers from the exponential distribution. I use the Beta value to skew the list in certain ways. For example a Beta of 0.05 and a max of 100 gives a list primarily of small integers. This emulates the distribution of word-counts in a vocabulary: many 1s and 2s and a few 50s or 100s.
  • rwt() returns a word-like token with, typically, more vowels than consonants.
  • rft() returns a token that looks something like the property flag values of the Word table.

You can pretty much ignore the code of these; they work and that's all I care about.

The database is a SortedDict from the excellent SortedContainers package (that link is to its doc; here it is on PyPI). The SortedContainers package is pure Python, minimizing cross-platform issues, and has performance as good as, and sometimes better than, packages based on C code such as blist. The database is populated with this code:

    col2 = xdi( 0.05, DBS, 100 )
    clear_the_db()
    for k in col2 :
        w = rwt(8)
        f = rft()
        DBM[ w ] = ( w, k, f )
    DBV = DBM.values()

DBS contains 10000. So this makes a dict with 10,000 items, with keys of 8-char random words. The dict is sorted. The values view gives indexed access to the sequence of values. That is, DBV[j] is the jth value in key sequence.

This code, we will see later, takes about 0.7 seconds to complete on a 3.5GHz Core i5. Lotsa computation going on.

You might notice a suspicious redundancy. The primary key (in the SQL sense) is w, and it is being stored both as the dict key and again in the tuple that is is the dict value. There is a reason for this.

The rest of the code is very standard, and minimal, PyQt5. There is a table model implemented in part as follows:

class Model( QAbstractTableModel ):
    def __init__ ( self, parent=None ) :
        super().__init__( parent )
        self.access_counts = [0, 0, 0]
 
    def rowCount( self, index ) :
        global DBM
        if index.isValid() : return 0
        return len(DBM)
  
    def data(self, index, role ) :
        global DBV
        if role != Qt.DisplayRole :
            return None
        row = index.row()
        col = index.column()
        self.access_counts[col] += 1
        return DBV[ row ][ col ]

The data() method counts the number of DisplayRole calls for each column. Its last line demonstrates the usefulness of the redundant storing of the key word in the value. Otherwise there would have to be (and initially was, until I thought of this trick) a test in the code:

        if col : # is 1 or 2,
            return DBV[ row ][ col-1 ]
        else: return DBK[row]

Return either one of two values stored as the dict value, or return the key. But storing the key redundantly in the value allows the data() method to eliminate one test. And with a multi-gigabyte laptop, who cares about a little redundancy? 10,000 extra string values? What's that, 0.00001 of a gig? Pfffffttt!

Below the table model comes the lengthy code needed to create a basic window and populate it with stuff. The only interesting bit is the slot attached to the clicked signal of the Refresh button:

    def do_refresh( self ) :
        self.table_model.clear_counts()
        self.table_model.beginResetModel()
        self.times[0] = time.process_time()
        rebuild_the_db()
        self.times[1] = time.process_time()
        self.table_model.endResetModel()
        QApplication.processEvents()
        self.times[2] = time.process_time()
        self.update_labels()

The update_labels() routine displays the counts of accesses, and the elapsed times for rebuilding the database and for completing the update of the table, in labels at the bottom of the window. Here is what the window looks like, with some typical values.

The values at the bottom of the window are typical. It takes around 0.73 of a second to recreate the database. That's not long enough to justify putting up a progress-bar, although it's close. The time to complete the model reset and redisplay the table is trivial, imperceptible to the user.

The access counts are interesting. Initially they were zero. Then I put the call to QApplication.processEvents() in, and that forced the table display code to run. But even then it only called for 12 rows of data. My first thought was, why not 10,000? Then I realized, oh, it's being clever and only getting enough data to populate the visible rows in the window. And that was proven when I dragged the window taller. Then a Refresh caused it to fetch 29 rows, exactly the number that filled the larger window. Also, it always fetches only that many rows, even when I have scrolled the table to the end. It knows what row indices it needs, and calls the model only for those.

However, this table doesn't sort. The table view initializes itself with

        self.setSortingEnabled( True )
        self.sortByColumn( 0, Qt.AscendingOrder )

That is why the table shows a sort order indicator (small triangle in the Column 1 header). One can click the column headings and the sort order indicator will move around and change as if sorting were happening. And presumably the sort() method of the table model (inherited from QAbstractItemModel) is being called, but it does nothing.

In the next post, I will show what happens when we interpose a sort filter proxy.

No comments: