Thursday, September 11, 2014

Little performance pick-up

The character panel (that I start work on tomorrow) will feature a button named "Refresh" meaning, bring the census of characters in the book up to date. This is implemented in the chardata module I worked on today. Initially I coded refresh() in the simplest way:

        editm = self.my_book.get_edit_model()
        c = self.census # save a few lookups
        self.k_view = None
        self.v_view = None
        self.census.clear()
        for line in editm.all_lines() :
            for char in line :
                n = self.census.setdefault(char,0)
                self.census[char] = n+1
        # Recreate the views used for fast access
        self.k_view = self.census.keys()
        self.v_view = self.census.values()

Get rid of the key- and value-views just in case sorteddict wants to try to update them as keys are added. Clear the sorteddict. Brute-force count all the characters. (editm.all_lines() is an iterator returning the lines of text in the document in order from first to last, as Python strings.) Recreate the views.

When the document managed by the edit model is about 25K characters, calling timeit on this method for four iterations took 0.75 seconds.

When the user opens a book for the first time, there is no metadata, and the character census sits empty until the user clicks Refresh. Then the above logic runs, loading the sorteddict. On a save, the list of characters and counts is written to the meta file, and reloaded when the book is opened again. The user clicks Refresh only after editing, to get an updated list of characters. Thus, almost every time Refresh is clicked, a dictionary exists that is almost complete. Possibly the user has added or eliminated a few characters (converted some non-Latin-1 characters to entity notation, for example); and the counts will be different. But the dictionary exists.

So it occurred to me to wonder whether this might not benefit from a trick I used in the word data Refresh method. If the dictionary exists, i.e. this is not the first time the document has been opened and a character census has previously been taken, don't throw the dictionary away. Go through it and zero all the counts; then take the census; then go through and delete any entries with a zero count. Applying this results in the much more complex method here:

        editm = self.my_book.get_edit_model()
        c = self.census # save a few lookups
        if len(c) : # something in the dict now
            for char in self.k_view:
                c[char] = 0
            for line in editm.all_lines() :
                for char in line :
                    n = self.census.setdefault(char,0)
                    c[char] = n+1
            mtc = [char for char in self.k_view if c[char] == 0 ]
            for char in mtc :
                del c[char]
        else : # empty dict; k_view and v_view are None
            for line in editm.all_lines() :
                for char in line :
                    n = self.census.setdefault(char,0)
                    self.census[char] = n+1
            # Restore the views for fast access
            self.k_view = self.census.keys()
            self.v_view = self.census.values()

Four iterations on the 25K book: 0.21 seconds. Keeping the dictionary and its views intact rather than recreating them saved considerable time. The Refresh operation should take only a barely perceptible delay even in a large book.

No comments: