Friday, February 6, 2015

help viewer, QWebEngine lacks, fuzzy matching

After yesterday's entry I had some more time and put it into starting the coding of helpview.py. This is a simple parentless widget (which makes it a modeless window) containing only one object, a web page. The main window creates one instance of it the first time the user selects File>Help.

The help widget initializes by getting the path to the "extras" folder and looks there for ppqt2help.html, and loads that file's contents into its web page. (If it doesn't find it, it loads some default text that asks the user to use Preferences to set the proper Extras folder path, and hooks the signal from the paths module that says path preferences have changed, so it can retry loading help whenever the user does so.)

The only other features of this widget are how it handles the close event, and a simple Find function. The closeEvent() method ignores the close, so the widget is never destroyed once created (until the app ends of course). It just hides itself. In the main window, when the user chooses File>Help a second time, it just calls the show() and raise() methods of the existing help widget, and it pops up nicely just where the user left it.

When I went to code the Find/Next feature I ran into yet another WebKit feature that is missing in QWebEngine. A QWebEnginePage object has a findText() method, just like QWebPage does. And the argument lists are the same: a string to find, and a selection from the FindFlags enum. However, that enum for QWebEnginePage has only two entries, FindBackward and FindCaseSensitively. Missing are 5 other find options that WebKit supports, including most importantly, FindWrapsAroundDocument.

Without the automatic wrap-around, the find is crippled; it can only find what follows the current position in the page. Or what precedes it. But repeated find-next without wraparound is unintuitive and awkward, to the user hunting around in a lengthy user manual.

Well, you know: my user manual HTML is very standard, very normal. No graphics, no HTML5, no plug-ins. So there's no benefit to using the WebEngine. So I just used the QWebView. Then I implemented a keyPressEvent handler that picked off ^f, ^g, and ^G and directed them to do_find, do_find_next, and do_find_prior. Those methods were only a few lines long. Find pops up a dialog to get some text to find, pre-loaded with any text that might be selected in the webview already. Next and Prior use the last-given find string to search forward or backward.

I got about 2/3 of that written yesterday evening and was able to finish it this morning, including modifying the main window code.

I also settled a niggling problem in wordview. A useful feature of the Word panel (lifted from good old Guiguts) is the option to see the "first harmonic" or "second harmonic" set for a given word. You right-click on a word and choose "first harmonic" from the context menu. The table then reduces to show only that word plus any words that are exactly one edit (insert, delete, or substitution) away from it. Second harmonic shows words that are one or two edits away. It's a very useful way to find important typos, like misspelled "Footnote" or "Illustration" keywords.

In version 1 I used a module that implemented the Levenstein algorithm. But for V2 I knew I wanted to use the regex package, in part because it offers "fuzzy matching" which is basically, Levenshtein implemented in C, generalized, and incorporated as a regex option.

In order to find the first harmonic to a word whose text is stored in word, I do this:

        rex = regex.compile('^(' + word + '){0<e<2}$')
        hits = set()
        for j in range(self.words.word_count()) :
            wx = self.words.word_at(j)
            if rex.match(wx) :
                hits.add(wx)

Say the word is "page". The compiled expression is ^(page){0<e<2)$ which says, match a complete string (the caret and dollar at the ends) that matches "page" with exactly one error ("e" is greater than 0 and less than 2). (For some reason you are not allowed to write {e=1}.) Run through all words in the database and put the matching ones in a set: "rage", "pale", "pages" etc. At the end, if the set isn't empty, put it into the sortFilterProxy where filterAcceptsRow() will only accept the rows containing those words.

It works and it's very quick, but it took a little while to get the regex tweaked to get the correct matches. At first I didn't have the caret/dollar delimiters. Then it would match any first harmonic that was contained within a word. For example, it accepted "pamela" because it matched "page" to "pame" with one error.

With all that done I did a "push origin master". The PPQT2 code is just about ready for an alpha test. I need to write a skeleton version of the help file. And I need to solve the packaging or bundling issue. I've made some progress on the pyqtdeploy front that I'll write about tomorrow, I hope.

No comments: