This Page Intentionally

David Cortesi

So yesterday I wrote how the Qt5.4 Webkit seems to be behaving itself. And it is, compared to the 5.2 and 5.3 versions. It actually displays all the hard comics without screwing up the rendering. And so far it hasn't crashed in any of the common ways I summarized in the previous post.

However, the following is the slightly edited console output of running it today.

[08:10:09 CoBro] python ./cobro.py
2014-12-31 08:10:21.581 Python[68621:14259725] Cannot find executable for CFBundle 0x10858fd00
</Users/dcortesi/Library/Internet Plug-Ins/DjVu> (not loaded)
2014-12-31 08:10:21.589 Python[68621:14259725] Error loading /Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin:  dlopen(/Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin, 265): no suitable image found.  Did find:
 /Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin: mach-o, but wrong architecture
GVA info: Successfully connected to the Intel plugin, offline Gen75

That was just for starters. Then as I got down toward the end of the list, after viewing a dozen comics without further messages, I clicked on Gunnerkrig Court and the console filled up with this.

[08:16:58.637] FigLimitedDiskCacheProvider_CopyProperty signalled err=-12784 (kFigBaseObjectError_PropertyNotFound) (no such property) at /SourceCache/CoreMedia/CoreMedia-1562.19/Prototypes/FigByteStreamPrototypes/FigLimitedDiskCacheProvider.c line 947
<<<< FigByteStream >>>> FigByteStreamStatsLogOneRead: ByteStream read of 8 bytes @ 4453661 took 0.573590 sec. to complete, 1 reads >= 0.5 sec.
Dec 31 08:17:18 Silver-streak-2.local rtcreporting[68621] : logging starts...
Dec 31 08:17:18 Silver-streak-2.local rtcreporting[68621] : setMessageLoggingBlock: called
[08:17:18.935] itemasync_GetDuration signalled err=-12785 (kFigBaseObjectError_Invalidated) (invalidated item) at /SourceCache/CoreMedia/CoreMedia-1562.19/Prototypes/Player/FigPlayer_Async.c line 2870
Dec 31 08:17:19 Silver-streak-2.local rtcreporting[68621] : startConfigurationWithCompletionHandler: Cached 0 enabled backends
Dec 31 08:17:19 Silver-streak-2.local rtcreporting[68621] : setUserInfoDict: enabled backends: (
 )
Dec 31 08:17:19 Silver-streak-2.local rtcreporting[68621] : initWithSessionInfo: XPC connection invalid
Dec 31 08:23:14 Silver-streak-2.local Python[68621] : CGContextSaveGState: invalid context 0x126544940. This is a serious error. This application, or a library it uses, is using an invalid context  and is thereby contributing to an overall degradation of system stability and reliability. This notice is a courtesy: please fix this problem. It will become a fatal error in an upcoming update.
Dec 31 08:23:14 Silver-streak-2.local Python[68621] : CGContextScaleCTM: invalid context 0x126544940. This is a serious error. This application, or a library it uses, is using an invalid context  and is thereby contributing to an overall degradation of system stability and reliability. This notice is a courtesy: please fix this problem. It will become a fatal error in an upcoming update.

The above message repeated 15 more times.

Do I need to analyze these things? Is there a single gorram thing I can do to prevent them?

Sigh. OK, one at a time. Inserting newlines so blogger can display them.

2014-12-31 08:10:21.581 Python[68621:14259725] Cannot find executable for
CFBundle 0x10858fd00
</Users/dcortesi/Library/Internet Plug-Ins/DjVu> (not loaded)

Some webcomic wants a "DjVu" plugin. This is what I get for telling the browser, PluginsEnabled(True). Because so many comics rely on Flash, you see. What is "DjVu"? I suppose I could find out, but supposing I did, is there any way I could ensure it would always be available wherever Cobro runs? No. And why oh why cannot Webkit fail silently and just put up a broken-plugin icon? Why does it have to blabber about it on the console?

Then we have this beautiful message, let's break it down by parts.

Error loading /Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin:

Another plugin the browser can't find. I don't care! Nobody cares. Just shut up.

dlopen(/Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin,
265): no suitable image found.

My heart bleeds for you.

Did find: /Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime
Plugin: mach-o, but wrong architecture

I'm still not caring.

GVA info: Successfully connected to the Intel plugin, offline Gen75

Good job, Webkit! If I knew what "GVA info" was, I'd probably be very impressed. What do you suppose it means, "offline Gen75"?

Then there's this wonderful streak of completely opaque babble:

FigLimitedDiskCacheProvider_CopyProperty signalled err=-12784
(kFigBaseObjectError_PropertyNotFound) (no such property) at
/SourceCache/CoreMedia/CoreMedia-1562.19/Prototypes/FigByteStreamPrototypes
/FigLimitedDiskCacheProvider.c line 947
<<<< FigByteStream >>>> FigByteStreamStatsLogOneRead: ByteStream read of 8 bytes @ 4453661
 took 0.573590 sec. to complete, 1 reads >= 0.5 sec.

This should really be bronzed and hung in a place of honor in the Hall of Indecipherable and Utterly Useless Messages. It is telling me about an error -12784, apparently having to do with some undocumented cache not finding some arcane property. But instead of explaining what that error means or how to fix it, it natters on about how it read 8 bytes in one-half second. I can personally read a lot faster than that, so I don't think a 4-core 3.5Ghz machine should be bragging about this.

The next part is just poetry. Lightly edited for aesthetic enjoyment,

: logging starts...
: setMessageLoggingBlock: called
=itemasync_GetDuration signalled err=-12785
   (kFigBaseObjectError_Invalidated)
   (invalidated item) at /SourceCache/CoreMedia/CoreMedia-1562.19/Prototypes/Player/FigPlayer_Async.c line 2870
: startConfigurationWithCompletionHandler: Cached 0 enabled backends
: setUserInfoDict: enabled backends: ( )
: initWithSessionInfo: XPC connection invalid

So helpful! So informative! So much not wanted by anybody... And then we have 16 repetitions of

: CGContextSaveGState: invalid context 0x126544940.
  This is a serious error.
  This application, or a library it uses, is using an invalid context
  and is thereby contributing to an overall degradation of
  system stability and reliability.
  This notice is a courtesy: please fix this problem.
  It will become a fatal error in an upcoming update.

Oh, it's a "courtesy" is it? And I am to "fix" the "problem" which is happening where and caused by what, and fix it how? Spare me your passive-aggressive manipulative bullshit and fix your own fucking bugs, thank you very much.

Maybe later today I'll google some of this horse-hockey. Or tomorrow; that'll be a good way to start a new year. Hope yours starts better.

David Cortesi

I posted a polite & constructive note to the WebEngine development list asking for a timeline on when some of the missing features might appear. No replies as yet.

Trying Cobro under Qt5.4 I am very pleased to see that the Webkit version is now behaving itself. There were several comics that it would display incorrectly. This failure was always that some small image file on the page would be shown at 1000% zoom and overlay the rest of the rendered page. It made Penny Arcade, for example, unreadable.

In addition there were at least four failure modes I've been documenting over the past few months of daily use. In no particular order,

Emitting a stream of a hundred or more messages "Critical failure: the LastResort font is unavailable", then a segfault.
"QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files... Abort trap: 6"
Segfault 11 somewhere deep in the Mac OS innards, usually with QNetworkConfiguration in the stack trace.
Segfault nested 50 or more calls deep in Webkit rendering code.

It typically crashed in one of these ways every 3 or 4 times I ran it. That was clearly unacceptable. I didn't mind it, just sigh and restart it, but I could never distribute a program that was so unreliable. I was really looking forward to changing to WebEngine and leaving all those nasty bugs behind. Unfortunately that appears not to be, because WebEngine doesn't support several functions that, although minor, I want to have for proper operation. The ability to command "private" (uncached) browsing, for example. It would be unacceptable to have the residue of browsing comics in one's work-browser's cache. Or the ability to implement the custom context menu I wrote, that lets you right-click on a link and make it open in the default browser. Can't do that under WebEngine because it lacks the feature of Webkit that lets you find out what type of data was under the mouse, and if a URL, get its text.

However, after several uses, the 5.4 Webkit browser has displayed none of the above problems. It handles Penny Arcade and Stand Still Stay Silent perfectly, and they were unreadable with 5.3. And none of the crashes has happened yet either.

So, I'm going to shelve the conversion to WebEngine for now. I've gone ahead to put in a couple of minor feature enhancements that I had in mind for some time, and I will continue reading comics with it daily, and just maybe it will prove reliable.

Meanwhile, I downloaded Nuitka and tried to install it. Unfortunately, although it claims to support all of Python through 3.4, the code executed by its setup.py displayed a number of syntax errors that were clearly due to byte-compiling Python 2 code under Python 3.4—missing parens on a print statement, for example.

So I joined yet another goddam dev mailing list and sent a polite query about this. Maybe tomorrow I'll get a reply from one group or the other.

David Cortesi

Here's how I thought git worked. I start a branch, (git checkout -b newbranch) and edit a file and save it. Have not committed anything yet but changed plenty.

Suddenly realize, I need to try this thing out in its earlier version. So I thought I could just do git checkout master and any changes made on newbranch would be swept under the carpet and I'd be back to the master version.

Did that: branched cobro, made a shit-ton of changes to cobro.py to switch it to WebEngine, realized, wait, I need to check how something worked with WebKit. Did "checkout master" thing and... cobro didn't change. Didn't go back to what it was. Still had all the WebEngine changes in it. I ended up doing a git revert HEAD, which lost all the changes, made a copy cobro-webkit.py, and now I get to make those changes again.

I have no idea what went wrong, or indeed if it was wrong or expected behavior. (Possibly it happened because the file is on the dropbox folder, maybe my laptop was fighting the desktop?)

Also I discovered two new things that QWebKit offers and QWebEngine does not. One is setting the web page non-editable. The other is "link delegation" where anytime a link is clicked, it emits a signal and lets you decide if it should go through. So now I have three posts on this theme at the qt-project forums. Sad to say, only one has drawn a reply. Although that reply points to the WebEngine dev mailing list, so maybe I will learn something from that.

David Cortesi

Needing to kill some time this evening while my wife worked on her website, I sat down and began the process of converting Cobro from QWebView, QWebPage, QWebSettings to use their modern but more verbose counterparts, QWebEngineView, QWebEnginePage and QWebEngineSettings. Things look good but as usual there are some issues.

Thanks to my splendidly modular and well-organized coding style, there is really only one object affected in the whole program, a QWebView derived class. I changed its parent class from QWebView to QWebEngineView, and then began going through all the self-initializing lines I had used to set up the QWebView.

Many of them had direct equivalents. For example, self.page() returns the underlying QWebEnginePage just as it previously returned the QWebPage. One still uses self.settings().setAttribute(something, True/False) for setting properties. However, I quickly discovered that several of the settable properties of QWebSettings are not available in QWebEngineSettings.

In particular, there is no JavaEnabled, no PluginsEnabled, and no PrivateBrowsingEnabled. I put a query about these three on the qt forum. It seems strange they would go to some trouble to make the new API compatible with the old, and yet leave out settings, and fairly significant settings at that.

Also my app implements a custom context menu. In it, I use the following to check to see if the thing that was right-clicked upon is a URL. Call self.page().currentFrame() to access the QWebFrame in which the right click occurred. Ask that for hitTestContent(QPoint) to find out what was under the point of the mouse event. If it is a URL, then I can access the text of that URL, and offer the user to open it in the default browser. This is very useful; Cobro is not a full-function browser, and if there is something on a comic page besides the comic, it is handy to be able to jump to a real browser.

Unfortunately it does not appear that QWebEnginePage offers access to the current frame. And there is no QWebEngineFrame class. So although QWebEnginePage specifically documents that you can implement a custom context menu, it is not clear how one can find out what was under the right-click. I also posted a query about this in the qt forum.

However, I just commented out all the parts that I couldn't immediately translate, and ran the app, and up it came. And displayed a comic very nicely. The progress bar signals worked, the titleChanged signal worked. So that was a useful couple of hours.

David Cortesi

So Phil at Riverbank computing dropped a Christmas present on us all, with PyQt5.4 on December 24th. Today I installed Qt5.4, the latest SIP, and PyQt5.4 on my new iMac and started working on Cobro. This all went very well. The installations went smoothly. I spent a couple of hours reviewing the code of Cobro in Wing IDE. I tidied up some comments. I removed some unnecessary global statements. And then ran it, and was pleased to find that this version can display some comics that it cannot running in 5.2. Apparently some of the many bugs in QWebkit have been fixed. Doesn't matter; my next move (on Monday) will be replace the webkit element with the new QWebEngine ones. Following that, probably Tuesday, I have a couple of minor functional enhancements to add.

Toward the end of the week, or next week, the next step I think will be to try compiling Cobro with Nuitka. In theory, the result of compiling a program like Cobro in Nuitka should be a single, stand-alone, self-supporting executable. If so, that would bypass the need for a packager like pyinstaller.

If the Nuitka experiment doesn't work, I will, as previously planned, start trying to use pyqtdeploy on it.

David Cortesi

For reference, here is where the experimental code is now.

import os
# set up path strings to a dictionary
dpath = '/Users/dcortes1/Desktop/scratch'
daff = os.path.join(dpath, 'en_US.aff')
ddic = os.path.join(dpath, 'en_US.dic')
print( os.access(daff,os.R_OK), os.access(ddic,os.R_OK) )
# Find the library -- I know it is in /usr/local/lib but let's use
# the platform-independent way.
import ctypes.util as CU
libpath = CU.find_library( 'hunspell-1.3.0' )
# Get an object that represents the library
import ctypes as C
hunlib = C.CDLL( libpath )
# Define the API to ctypes
hunlib.Hunspell_create.argtypes = [C.c_wchar_p, C.c_wchar_p]
hunlib.Hunspell_create.restype = C.c_void_p
hunlib.Hunspell_destroy.argtypes = [ C.c_void_p ]
hunlib.Hunspell_get_dic_encoding.argtypes = [C.c_voidp]
hunlib.Hunspell_get_dic_encoding.restype = C.c_char_p
hunlib.Hunspell_spell.argtypes = [C.c_void_p, C.c_char_p]
hunlib.Hunspell_spell.restype = C.c_uint
# Make the Hunspell object
hun_handle = hunlib.Hunspell_create( daff, ddic )
# Check encoding
print(hunlib.Hunspell_get_dic_encoding( hun_handle ))
# Check spelling
for s in [ 'a', 'the', 'asdfasdf' ] :
    b = bytes(s,'UTF-8','ignore')
    t = hunlib.Hunspell_spell( hun_handle, b )
    print(t, s)
# GCOLL the object
hunlib.Hunspell_destroy( hun_handle )

Let's see if changing the create argtypes makes a difference.

Bingo! Made the following changes. One, change the argtypes of create():

hunlib.Hunspell_create.argtypes = [C.c_char_p, C.c_char_p]

That caused a ctypes error on the call _create(daff,ddic), because a Python3 string is not compatible with c_char_p. So encode the strings:

baff = bytes(daff,'UTF-8','ignore')
bdic = bytes(ddic,'UTF-8','ignore')
hun_handle = hunlib.Hunspell_create( baff, bdic )

Et voila, the output is

b'UTF-8'
1 a
1 the
0 asdfasdf

Most excellent! I have achieved my goal of invoking Hunspell for spell-checking without use of the pyhunspell package. I am not sure if I want to change my existing dictionaries.py to do this in place of relying on the package. For sure, if I have even the slightest trouble installing the package on Windows, I will be quick to fall back on this.

David Cortesi

Ok. I have figured out one issue. I looked at my own code that creates a dictionary and noticed that it presents the two path arguments to hunspell.Hunspell() with .dic first, .aff second. And that is documented in the hunspell package doc. Making that change, the hunspell package works. It correctly notes the Greek dictionary is UTF-8 and spells a word.

Hgr = hunspell.HunSpell(pdic,paff)
Hgr.get_dic_encoding()
'UTF-8'
Hgr.spell('α')
True

And if I present it with my en_US dictionary saved in UTF-8, it opens it correctly also. This is rather bad, in that anyone looking at the Hunspell doc at the Hunspell sourceforge page will see "Hunspell(const char *affpath, const char *dpath);" which is exactly the reverse of the hunspell package. If you present the files in the reverse order (the correct order per the man page), the Hunspell object is created, no error is reported, but it can't check spelling, calls any input misspelled.

What about my ctypes invocation? Well, that definitely uses the C-defined function which should take the .aff first, the .dic second. Checking the pyhunspell code it definitely passes the aff-path first, dic-path second, which is what my ctypes invocation is doing.

I do note a comment in the pyhunspell code, "Some versions of Hunspell_create() will succeed even if there are no dictionary files." So that's probably what's happening: for some reason it is not opening the path strings I am passing, and it silently fails and defaults to a rather useless, and undetectable, no-dictionary condition.

The likeliest cause of that is it is not getting the path strings in a form it expects. Maybe it can't handle c_wchar_p after all. Before I experiment with that, I am going to add a call to destroy the Hunspell object. Unlike a PyQt object, it isn't known to Python. I may be memory-leaking a Hunspell object every time I run my test code.

David Cortesi

Right, so we have used ctypes to locate the Hunspell dylib and invoke the Hunspell_create() function returning a handle to a C++ object of class Hunspell. That demonstrated that Python 3 strings could be passed to a C function that expected const char * parameters.

Then we invoked a method of the Hunspell object by calling the C wrapper Hunspell_get_dic_encoding(), and huzzah! it returned what it was supposed to return, Hunspell's belief about the encoding of the dictionary's .dic and .aff files. It returned 'ISO8859-1' which may prove to be significant.

Next was to try to invoke the most important method, spell(). If this works, I can toss the whole hunspell.py package and just use fewer than 20 lines of ctype code (maybe 30 lines, adding code for platform dependencies). Hunspell has a dozen other methods, suggestions, stemming, etc., but all I need is spell(word) yielding 0 for bad and nonzero for good. The C header file says,

LIBHUNSPELL_DLL_EXPORTED int Hunspell_spell(Hunhandle *pHunspell, const char *);

Translating to Python,

hunlib.Hunspell_spell.argtypes = [C.c_void_p, C.c_wchar_p]
hunlib.Hunspell_spell.restype = C.c_uint

OK, let's do it!

for s in [ 'a', 'the', 'asdfasdf' ] :
    t = hunlib.Hunspell_spell( hun_handle, s )
    print(t, s)

Output:

b'ISO8859-1'
0 a
0 the
0 asdfasdf

Not good! Neither "a" nor "the" are valid words.

My diagnosis is this. I know the dictionary was opened successfully, and the words are in it. Either the word is not being passed correctly, or Hunspell is not comparing it correctly. I tried several variations on passing the argument, for example I changed the argtypes to show it as taking a c_char_p (no change), and then converted the word (b = bytes(s,'ISO-8859-1','ignore')) and passed the byte string (no change) and again encoding as UTF-8 (no change).

It stands out that Hunspell thinks the dictionary is encoded Latin-1. It lurks somewhere in my memory that I solved a similar problem by converting the dictionary to UTF-8 encoding. The encoding of the .dic file is specified in the .aff file in a SET statement. So I opened both files in BBEdit and saved them as UTF-8, also changing the .aff file to read SET UTF-8 (which is the same as the SET statement in a Greek dictionary). Tried again.

b'ISO8859-1'
0 a
0 the
0 asdfasdf

Wait, what? The SET statement says, and the actual file encodings are, UTF-8, but get_dic_encoding returns ISO8859-1? Just in case there's some kind of file caching going on, I copy the dictionary files to a different folder and change the path string to match. No change! I re-save the files as UTF-8 "with BOM". No change; it still returns ISO8859-1.

Now I doubt my prior diagnosis. Hunspell is ignoring the content of the dictionary, which possibly means it isn't reading it at all. Maybe it is failing to open the files, not reporting a failure, and returning some kind of default?

Does the hunspell package do the same?

import hunspell
import os
dpath = '/Users/dcortes1/Desktop/scratch'
daff = os.path.join(dpath, 'en_US.aff')
ddic = os.path.join(dpath, 'en_US.dic')
Hobj = hunspell.HunSpell(daff, ddic)
Hobj.get_dic_encoding()
'ISO8859-1'
Hobj.spell('the')
False

OK, I am officially flabbergasted. My flabber is gast. That dictionary is UTF-8 and defines "the". It is nice, in a way, that the package (that works fine inside PPQT1 and 2) is failing exactly as my ctypes experiment failes, but what the heck am I doing wrong?

I'll post again when I understand more.

Update: this much is confirmed:

Hobj = hunspell.HunSpell('aasdf/asdf/asdf.aff','aasdf/asdf/asdf.dic')
Hobj.get_dic_encoding()
'ISO8859-1'

If the Hunspell object creation cannot open the .aff/.dic files, it fails silently and uses a null dictionary with a default encoding. And I don't see any way of testing whether this has happened or not!

David Cortesi

Implementing spellcheck has been a constant problem for me. I solved it, awkwardly, using the pyhunspell package (see below for another link). This provides a Python interface to the Hunspell checker.

There is nothing at all wrong with Hunspell itself. It is complete, fast, and still supported. It is superior to Aspell and Myspell in several ways. Most importantly it supports UTF, and so can be used to spellcheck German, Greek, and the like.

My issues are, or were, with the pyhunspell package. It went unsupported for a long time. It didn't support Python3 until a user posted the necessary small changes as a comment on an issue. And getting it compiled and working on Windows was, for me, a huge problem. So I wanted to experiment to see if I could access the hunspell library direct from Python using ctypes, eliminating the need to compile a wrapper. Important note: I just discovered that pyhunspell was very recently picked up by a new owner, Benoît Latinier, and rehosted on github: here is its new home. Another user has posted a binary package for Windows on the old site; unfortunately it's for Python 2.7. So things are looking up for pyhunspell. Which is a good thing, because as I will now finally get around to saying, the ctypes experiments are not going super-well.

We start with getting access to the library.

# Find the library -- I know it is in /usr/local/lib but let's use
# the platform-independent way.
import ctypes.util as CU
libpath = CU.find_library( 'hunspell-1.3.0' )
# Get an object that represents the library
import ctypes as C
hunlib = C.CDLL( libpath )

To do spell-checking, one must create a Hunspell object. The C header declares:

typedef struct Hunhandle Hunhandle;
LIBHUNSPELL_DLL_EXPORTED Hunhandle *Hunspell_create(const char * affpath, const char * dpath);

Converting that to ctypes, we have:

hunlib.Hunspell_create.argtypes = [C.c_wchar_p, C.c_wchar_p]
hunlib.Hunspell_create.restype = C.c_void_p

OK, let's call it!

dpath = '/blah/blah...'
daff = os.path.join(dpath, 'en_US.aff')
ddic = os.path.join(dpath, 'en_US.dic')
hun_handle = hunlib.Hunspell_create( daff, ddic )

Well, nothing crashed. At this point we should be able to use methods of the Hunspell object. Back to the C header file:

LIBHUNSPELL_DLL_EXPORTED char *Hunspell_get_dic_encoding(Hunhandle *pHunspell);

In Python/ctypes:

hunlib.Hunspell_get_dic_encoding.argtypes = [C.c_voidp]
hunlib.Hunspell_get_dic_encoding.restype = C.c_char_p
print(hunlib.Hunspell_get_dic_encoding( hun_handle ))

And whoop-de-doo, it prints b'ISO8859-1'. To review: we have successfully loaded the library, created a Hunspell object, and invoked one of its methods. During creation, the object correctly loaded the dictionary that was passed. Ergo, we can pass a Python3 string into a const char * argument. This is looking great!

And this post is getting long. I will continue with actual spell-checking next time. Spoiler alert! It doesn't go well!

David Cortesi

I completed the Preferences dialog. It looks about the same as the test version shown in the prior post, plus the addition of buttons for Defaults, Cancel, Apply and OK.

Getting things to work smoothly with the "Apply" button took a bit of work. There are four highlight types the user can change: the editor current line, the text of a limited Find/Replace range, spelling error words, and scanno words. The first two use one mechanism, the second two use a completely different one.

As noted in some post I can't be arsed to look up now, the current-line highlight is done using an "extra selection", and the find-range highlight is done using another. An extra selection is a peculiar thing unlike any other class in Qt (that I know of; maybe the graphics area has similar things). It has no behaviors, no methods; it is basically a two-ple of a cursor and a format (QTextCursor and QTextCharFormat). You give your QTextDocument a list of your extra selection objects and it applies each one's format on its cursor's selection.

The current line's cursor gets updated whenever the edit cursor moves. The find-range cursor gets updated only when the user toggles the In Selection switch in the Find panel.

Their formats get updated only when the Preferences dialog calls the colors module set_current_line_format or set_find_range_format. At that time, the colors module emits a signal, ColorsChanged, which is fielded by editview, and it refreshes the formats in its list of two extra selections.

Therein lies one problem. The user can choose a new format for current line or find range, and click the Apply button in Preferences, and the the highlighting in the visible part of the Edit panel should change immediately. But it didn't, apparently because signals don't get processed while a modal dialog is up. However, I added a call to QCoreApplication.processEvents() in the Apply logic, and then, ta-daa, those two highlights changed instantly upon Apply.

The spellcheck and scanno highlights are created by a different mechanism. They are applied by a QSyntaxHighlighter. Syntax highlighting is turned on by assigning a document to the highlighter, and turned off by assigning a null document to it. But the highlights it applies don't change once they are set until the highlighted text is hidden and shown again, e.g. by paging the document in the editor.

So those two highlights didn't change even though the editor was getting control in its ColorsChanged signal slot. I had to add logic to this slot to ask, is highlighting of either scanno or spelling now active? If so, turn highlighting off, and turn it on again. That forces re-tagging of visible words. With that change, the visible highlighting changes instantly when Apply is clicked.

I am a bit concerned about this last, because in Version 1, there was a significant delay when you turned on the syntax highlighter in a large document. I assumed this was because at that time it would go over the whole document passing every text block through the highlighter. If Qt5 behaves like Qt4, clicking the Apply button in Preferences might incur a significant (1-3 second) stall when the spelling or scanno highlight has changed. That delay isn't noticeable now. There's no perceptible delay with a 25-page document. I hope that Qt5 is smarter and only invokes the highlighter for visible text. Even if there is a delay, these highlights are not something you'd change often.

For now it all works nicely. Doing Preferences was supposed to fill the time until PyQt5.4 was out. I'm done, and it isn't. Now what?

David Cortesi

This week, waiting for PyQt5.4 to be released, I've been working on the Preferences dialog, and it's coming out rather interesting. Alongside here is a test version.

It's a stack of items. I built a whole little O-O heirarchy to make this. There's a parent class, ChoiceWidget, that displays a title line and on mouse-over, changes color and puts an explanation in the explainer box at the bottom. For the path-entry items, there's a PathChooser class that implements the path line-edit and a browse button, and when focus leaves the line-edit, checks that the given path is accessible according to some criterion (R_OK or X_OK) passed to its initializer, and if not, beeps and makes the line-edit pink. For the text-format items, there's a FormatChooser class that implements the color swatch and the sample highlight.

And then there's the font-chooser. Here I wanted to use QFontComboBox, which displays the available fonts using those fonts. Unfortunately it is rigidly designed. When told to display only monospaced fonts, it displays only the fonts the QFontDatabase thinks has that property.

Unfortunately, I am including two monospaced fonts with the program, Liberation Mono and another that I only just discovered, Cousine from the Chrome Core set. Cousine is basically the same as Liberation Mono except it has even more Unicode glyphs.

Either of these is a better choice than the next-best font, Courier New. The latter has about as wide a Unicode repertoire but it has a poor contrast between 1/l and 0/O. Alas, the QFontDatabase will not recognize either of the fonts that I am loading (using QFontDatabase.addApplicationFont()) as being actually monospaced. It will not return their names when asked for a list of monospaced fonts.

Which means that if QFontComboBox is set to display only monospaced fonts, it will not display the two that I most want the user to have access to. And although it claims to support the methods of its parent QComboBox, it ignores a call to addItem(string). So I can't add them. If I don't tell it to show only monospaced fonts, it of course shows every font available in the system, and takes quite a while to open first time.

I wasted quite a bit of time today trying to remedy this, first by trying to find some way to get QFontComboBox to show all but only the fonts I wanted it to show (the known monospaced fonts plus my two); then by trying to find a way to change my fonts so that QFontDatabase would recognize them as monospaced. That entailed a very lengthy search for a free or cheap TrueType font editor that would let me verify and set what I presume is a one-bit flag in the font file format that says "really, I'm monospaced." Didn't find one. Well, one, but it costs a bundle and its 30-day free trial version will not allow saving a modified font. So fuck them.

Actually, after all that, I'm not sure whether any amount of editing would help. The Mac OS Font-Book application does know these two are monospaced. It lists them when I make a "smart search" for monospaced fonts. So it may be that QFontDatabase is just prejudiced against added application fonts.

In the end I used a plain QComboBox loaded with the family names of monospaced fonts. I set it up so that when you make a selection from the list, the "explanation" box at the bottom changes to use that font.

I meant to finish the Preference dialog today, I really did. But the fiddling with fonts killed too much time. What's left is to implement the dialog widget itself, including the important "Set Defaults", "Apply", "Cancel" and "OK" buttons. A few hours more.

David Cortesi

I've completed the Footnotes panel. Its testing was rather minimal, although I believe sufficient. The footnote "model" did receive a rather thorough fnotdata_test.py module that exercises every branch and error condition. With a solid model, the view/controller piece goes together very quickly. I set up a fnotview_runner.py that starts up the app, loads a document full of footnotes of various types, and sits waiting for interaction. I used this to exercise all the functions manually, and quickly ran into a vexing problem that took a couple of hours to track down. When I knew the cause, I was even more vexed. Look at the following code fragment. Hands up all those who see the glaring, obvious, stupid error?

        self.model.beginResetModel()
        worktc = self.edit_view.get_cursor()
        worktc.beginEditBlock()
        try:
            for j in range(self.data.count()):
                # ...twenty lines of code computing a new key value
                # for footnote j...
                if new_key is not None :
                    self.data.set_key(j, new_key, worktc)
            # end of for j in range of keys
        except Exception as whatever:
            fnotview_logger.error(
                'Unexpected error renumbering footnotes: {}'.format(whatever.args)
                )
        worktc.endEditBlock
        self.model.endResetModel()

Here's what's going on. The user has clicked Renumber. The view/controller tells its QTableModel that data is changing. It obtains a QTextCursor on the document and starts an "edit macro" on it, so that all changes made using it will be a single Undo. For each existing footnote key it computes a new renumbered value, and calls the data model to set that as the new Key value in both the Anchor and Note of footnote j.

The model gets passed the working text cursor and uses it when it updates the Key value in the Anchor ([key]) and the Note ([Footnote key:...). At the end of the loop, even if there were errors, the edit macro is closed and the table model is told it can refresh itself.

When executed, this caused all sorts of flaky behavior in the editor. The new Key values would not appear unless I made the page scroll. Undo did not always undo the changes. Doing new changes compounded the problems.

I spent quite a bit of time over two days inserting print statements and tracing and... before I finally noticed that glaring error that you, dear reader, spotted five minutes ago. Fix that and suddenly everything worked "just swellegant" as my late father liked to say.

Well, live and learn.

Or not.

Looking ahead, again

Anyway, that's done, and also I fixed a serious, if obscure, bug in Version 1 and released new packages for it. A few posts back I put up the current to-do list. A month has passed; time to revisit it.

When Qt5.4 and the matching PyQt are available, install those on the new iMac and move development to there.

Qt5.4 is out, PyQt5.4 is expected any minute. So this should happen next week I hope.

Then, bring CoBro up to the Qt5.4 level and replace the execrable WebKit browser with the new WebEngine one.
Then, use Cobro as a test-bed for learning how to use pyqtdeploy to bundle an app. I am eager to find out if this is truly a way to make a self-contained executable on all 3 platforms, in place of pyinstaller.
Presuming that works (and that the new web engine fixes the frequent crashes induced by webkit), release CoBro on all three platforms.
Then, or right now to pass the time waiting for Qt5.4, code the footnotes module. That will go fast; most of the code can be lifted out of version 1...

It took a bit more work than that, but it's done.

Then, or right now if PyQt5.4 is delayed, implement the Preferences dialog.

At that point — which will in no way be reached in calendar 2014 — PPQT2 will be at what might be called an alpha state, that is, with adequate function that an experienced user could post-process a book with it. That user would have to run from source, however, until the pyqtdeploy work is complete.

The work to be done after that includes:

Writing the translation interface module, which includes figuring out how to dynamically load translator modules.
Writing the plain-ascii example translator
Writing the HTML example translator
Bringing the "keyboard palettes" of V1 forward to V2 and making them load dynamically (using the same scheme as the translators?)
Finally going back into the UI and make panels drag-out-able, applying the drag-drop research with which I began this series of posts many months ago.
Writing the Help file and adding the Help panel
Rewriting the "suggested workflow" document to reflect all the changes; for this I will want to actually post-process a book myself to make sure I know the best way to use the app.
Make some screencasts to explain PPQT and show its features.

David Cortesi

So yesterday a user of PPQT V1 found a real bug, the first actual "this is a coding error that ought to be fixed" bug since I stopped development many months ago. It's something in the ASCII reflow code. Under a particular setting of parameters, reflowing a poem produces a stack trace,

Traceback (most recent call last):
  File "/Users/original/Desktop/scratch/build/ppqt/out00-PYZ.pyz/pqFlow", line 374, in reflowDocument
  File "/Users/original/Desktop/scratch/build/ppqt/out00-PYZ.pyz/pqFlow", line 591, in theRealReflow
  File "/Users/original/Desktop/scratch/build/ppqt/out00-PYZ.pyz/pqFlow", line 1350, in optimalWrap
IndexError: list index out of range

So it is something in the rather complex Knuth-Pratt optimal rewrap logic. I haven't looked at the code yet to see what the problem really is, nor have I made up my mind what to do about fixing it. There appears to be a not-too-awkward work-around so maybe I'll do nothing. I really do not want to have to rebuild the distribution bundles for V1. It would probably be doable. I just don't want to do it.

That's one of the thoughts I'm having, here at 6am in Honolulu, where we are visiting for the Thanksgiving weekend, instead of sleeping for another hour. Lying in bed obsessing about PPQT instead of sleeping.

But there's another and more serious thought, and that is about Qt and Edit Blocks, sometimes called edit macros. If you want to make a series of edit actions undoable with a single Undo, you create an edit block:

    work_tc = QTextCursor(my_edit_document)
    work_tc.beginEditBlock()
    # change the document in many ways via
    # work_tc, often in a loop
    work_tc.endEditBlock()

It works quite well, as long as the endEditBlock() call is executed. Well, why would it not? It would not, if somewhere between the Begin and the End your program raises an un-caught exception like "list index out of range" not inside a try-except block.

That's what is happening above. Reflowing the text is done inside an edit block, and because of the exception, the block is never ended.

Normal Python programs, when they cause a stack trace like this, simply terminate. But not a Qt app! The top function in the stack trace, reflowDocument(), was called by the QApplication as a result of processing an event, in fact the event of clicking on the "Reflow Document" button. The QApplication doesn't care that this subroutine ended with an exception. It ended, that's all. The QApplication keeps running, processing other events, calling other methods of the various objects to respond to button clicks and menu choices and edit keystrokes.

I really don't know what happens to the open Undo block. I presume the variables created by reflowDocument() go out of scope. What happens when a QTextCursor with an open Undo Block goes out of scope and is garbage-collected? The user reports that the operation cannot be undone.

In fact the whole document is in an ambiguous state and probably should not be saved. Maybe it is alright but maybe there is some garbage in it, or some text is missing (because the reflow logic deleted it and had not put it back when the error occurred). But if the user calls Quit, there will be a prompt to save the modified document.

The very important lesson here is: An Undo Block should never be opened unless it can be guaranteed to close. The situation is just like modifying a file: the logic should always be

    work_tc = QTextCursor(my_edit_document)
    work_tc.beginEditBlock()
    try:
        # change the document in many ways via
        # work_tc, often in a loop
    finally:
        work_tc.endEditBlock()

Thus the Undo Block will always be closed no matter what goes wrong.

I did not do this at any of the several points where I have Undo Blocks in V1. And I realize (here in the dark, not sleeping on a Saturday morning in Hawaii) that I just wrote the first use of an Undo Block in V2, in the footnote code, and I did not do it there. So that's bad. I need to fix that.

But what shall I do about this V1 problem? I will probably have to try to fix it and make new distribution bundles. You cannot imagine my reluctance to do so.

David Cortesi

I'm thankful for having the leisure to futz around with programming to my heart's content.

Just as a side note, although Qt 5.4 isn't out, the documentation for it is up at the shiny new Qt website, Qt.io and looks very nice.

Oh, also, the footnotes panel is almost done. The data model is done and tested; the view/controller is completely coded and I'm confident a day of testing it will finish it. Tuesday next, hopefully.

When that's done, and if Py/Qt5.4 is still not out, I will do a long-neglected piece, the promised Preferences dialog, or at least a first draft of it.

David Cortesi

So the other day I happily wrote, code the footnotes module. That will go fast; most of the code can be lifted out of version 1 and needs only a wash and brush-up to use the V2 APIs.

So, not quite. The version 1 module is 1200 lines of fairly complex code which I haven't looked at in a year. I want to break this up into a "model" module and a "view" module. So I have to figure out which bits to copy into the model and which into the view. That's pretty clear, thanks to my generally readable and structured coding style, but still takes thought. And defining an API between them, one that will be logically clean and maintainable, but also will not add needless overhead to slow operations down in a large book.

There will be big chunks of code that can be copy/pasted, but even those lines will need individual editing.

What does carry over is the general logical structure and methods. For example the code to find footnotes took some time to work out originally, but I can reuse the logic flow. It goes like this, approximately:

Scan all the lines in the document looking for footnote[*] "anchors"[B] like those.
    Save them as a list of QTextCursor objects that select just the Key strings ("*" and "B")
Scan all the lines in the document looking for "^\[Footnote (Key):..." 
    For each, find the end of the note, defined as the next line that ends with "]"
    Save them as a list of QTextCursor objects that select the line(s) of each Note
Merge the two lists matching the Key of each Anchor to the first matching Note after it
    Remove the cursors for matched notes from their lists and
        add the (anchor, note) pair of cursors to the database as an item
    Insert the remaining unmatched Anchors in the database as (anchor, None)
    Insert the remaining unmatched Notes in the database as (None, Note)

Everything the view displays to the user in a 6-column table can be derived from the two QTextCursors, and QTextDocument keeps the QTextCursors updated as the user edits.

What takes time is that this code needs to be brought into a class definition, because each Book has its own Footnote database object. (Remember: allowing multiple books open changes everything.) So all of what are global variables in V1 become self.variables in V2. And I generally change camelCase names (other than Qt interface names) to under_score_names. So everything gets edited and moved.

So not really a "wash and brush-up" but more like a "strip it to the studs and put in all new wallboard and floor tile" remodel. 'twill take a few days. Satisfying work, though.

David Cortesi

On branch new_meta, changed a number of modules to use the new JSON-based metadata system, and to store that metadata in a file suffixed .ppqt instead of .meta. Also added signals to the worddata and pagedata modules and slots to the corresponding view modules so that when the metadata is read in, the visible tables based on it update automatically. In the course of this I had to rewrite the test drivers for all the affected modules, and in that process made a number of improvements in how they were coded. Tested it all, and it seems to be working very well.

git checkout master; git merge --no-ff new_meta; git push origin master and done.

A couple of minor tweaks to do to things I noticed while going through the code; and I want to spend several hours tidying up the Tests folder and making sure that py.test runs things correctly. I would like to bring the Sikuli-based UI tests under the py.test umbrella but am not quite sure how to do that.

Here are the things to do after that.

When Qt5.4 and the matching PyQt are available (which should have happened already but hasn't), install those on the new iMac and move development to there. There's no real "moving" involved other than my ass from one chair to another, as all the affected files are in Dropbox anyway.
Then, bring CoBro up to the Qt5.4 level and replace the execrable WebKit browser with the new WebEngine one.
Then, use Cobro as a test-bed for learning how to use pyqtdeploy to bundle an app. I am eager to find out if this is truly a way to make a self-contained executable on all 3 platforms, in place of pyinstaller.
Presuming that works (and that the new web engine fixes the frequent crashes induced by webkit), release CoBro on all three platforms.
Then, or right now to pass the time waiting for Qt5.4, code the footnotes module. That will go fast; most of the code can be lifted out of version 1 and needs only a wash and brush-up to use the V2 APIs.

At that point — which might be reached in calendar 2014, certainly in early 2015 — PPQT2 will be at what could well be called an alpha state, that is, with adequate function that an experienced user could post-process a book with it. That user would have to run from source, however, until the pyqtdeploy work is complete.

The work to be done after that includes:

Writing the translation interface module, which includes figuring out how to dynamically load translator modules.
Writing the plain-ascii example translator
Writing the HTML example translator
Bringing the "keyboard palettes" of V1 forward to V2 and making them load dynamically (using the same scheme as the translators?)
Finally going back into the UI and make panels drag-out-able, applying the drag-drop research with which I began this series of posts many months ago.
Writing the Help file and adding the Help panel
Rewriting the "suggested workflow" document to reflect all the changes; for this I will want to actually post-process a book myself to make sure I know the best way to use the app.
Make some screencasts to explain PPQT and show its features. The V1 screencast I made impressed a few people much more than any amount of words.

I would love to have this all done by mid-2015 but suspect it might drag on a bit longer.

David Cortesi

Continuing on git branch new_meta. Finding each module that calls the metadata manager and recoding it to save and load in the new JSON format. This usually results in a vast simplification. Previously, the "writer" method received a stream handle and was responsible for creating and writing formatted lines of text to encode its type of metadata; and the "reader" method got a stream handle and had to read the lines of formatted text and decode them. Under the new regime, the writer returns a single Python value (typically a list or dict), and the reader gets that single value as an argument. No more formatting data as lines and streaming them with << or >> operators. Just a blob of data out, a blob of data in.

For each module there's a modname_test module that exercises it. These unit-test drivers used the metadata system heavily. They formatted metadata streams and pushed them in via the metadata manager, and then used the manager to suck the metadata back and check it. Or pushed in invalid metadata and checked the contents of the log for proper error messages. It was a handy way to exercise every branch.

Naturally when the metadata readers and writers of a module change, so also must change the test code that prepares metadata and reads it back. So far there's about 3 times as many lines of code to alter in the test drivers as in the driven code. (Picture a frowny-face icon here.)

All went smoothly modifying and testing the four types of metadata handled by book.py (edit font size, edit cursor position, default dictionary tag, and user bookmark positions 1-9). Each of the reader/writer pairs became simpler, as expected.

Next up in alphabetic sequence is chardata.py. This is the module that maintains the census of characters in the document. Originally it did it using a sorteddict from the blist package, but recently I discovered the sortedcontainers package which is as fast as blist, and pure Python.

Either way, the character census is in a SortedDict object with single unicode characters as keys, and integer counts as values. So obviously, the metadata writer function could consist of just: return self.census that is, return the value of the dict of character counts. The reader would receive that dict as a single value. It had to be a bit more careful because the user might have edited the metadata, so the reader has to do basic sanity checks: are the keys single characters, the counts greater than 0, etc.

But this pretty scheme didn't work out well for the test driver. The test driver loaded the document with the contents of "ABBCCC" and then called the metadata manager to get the character census. Immediate error: "SortedDict cannot be serialized by JSON". Oh. Right. OK, change the writer to return dict(self.census). Convert the SortedDict to an ordinary dict. This worked in the sense that it could be serialized to JSON, but when the test driver pulled the metadata and compared it, it failed with:

expected: {"CHARCENSUS":{"A":1,"B":2,"C":3}}
received: {"CHARCENSUS":{"B":2,"C":3,"A":1}}

Oops. Obviously what's happening is that when json.dumps() a dict, it writes it in the order returned by dict.items(), which is the order of the key hash table. That isn't predictable. Time to stop to think.

Ok, I can leave it this way, and write the test driver to basically do a set-wise comparison on two dicts, ensuring that the received dict has all, but only, the keys and values of the expected dict. Not fun. Also, if I leave it as-is, it pretty well screws the possibility of the user editing this part of the metadata file. How would you find the entry for "X" in a random-sequenced list of 150 or more characters? And think ahead to worddata, which has almost the same structure: if its 5000-10000 metadata values aren't in sorted order, what a sordid mess.

So better to change the metadata format to something that can be sequenced. I rewrote the metadata writer as:

    return [ [key,count] for (key, count) in self.census.items() ]

The items() method of a SortedDict returns them in sorted order by key. JSON serializes list items in the order given, so they are in the file in sequence. It was no more code in the reader, because the reader already had code to examine each (key, count) item for validity.

David Cortesi

I have done little coding this week, partly owing to taking more than a full day to complete the installation of a new computer:

This 27-inch "retina" iMac is sitting on a wall-mounted desk unit that has been the "office" section of the family bedroom since the 1980s. I got to wondering what other computers it has formerly supported. Here's the list as best as I can put it together:

An S-100 bus CP/M system with a home-assembled Heath/Zenith Z-19 monitor
A Zenith Z-89 CP/M system
A Mac SE/30
A Macintosh IIci with Radius Pivot monitor
A Power Macintosh (Blue and White)—can't remember what monitor that used
A Mac Pro with an Apple Cinema Display

I've owned other machines, such as a series of PCs while I was writing books about them, several different Mac portables, at one point even an Apple II with a Z80 CP/M card in it. But the ones listed above were the ones that sat on this office desk and got serious use for multiple years each.

The Mac Pro, bought within a month of its announcement in 2006, served the longest of any, more than eight years. It came with OS X 10.4 "Leopard" installed, shortly upgraded to 10.5 "Tiger", then to 10.6 "Snow Leopard".

Snow Leopard was a splendid OS, and I used it for nearly three years before I reluctantly "upgraded" (an upgrade it was not) to Lion. That was the end of the line, because this early-model machine had a 32-bit BIOS and was cut off from the genuine upgrade to 10.7 "Mountain Lion".

I kept the machine for two more years as it fell farther and farther off the software state of the art, simply because Apple didn't offer an adequate replacement. The new "canister" Mac Pro didn't interest me because I was tired of piecing systems together out of components. I didn't want to have to figure out what kind of external disk to buy for it, and anyway the current Apple monitors were clearly lagging technologically. Sooner or later, I was sure, Apple would have to produce a nice, tidy all-in-one iMac with a big screen with "retina" pixel density. When they fiiiiiinally announced one, I jumped—just as I had leapt onto the Mac Pro in 2006 to replace the aging Blue and White. (I hope I don't find out in five years that I should have waited six months for a crucial upgrade, just as I would have been better off waiting six months for a Mac Pro with a 64-bit BIOS!)

"Installing" a new computer is more about housekeeping than computers. I pulled out everything from this corner of the room, disturbing dust-bunnies that had been accumulating since the Mac Pro was new. An armload of old, incompatible software CDs and manuals went in the trash. It took several hours to do the general housecleaning of the area and make it all fresh and neat.

So far, the iMac looks like a keeper. The "magic" mouse is slick: no cord, and I can scroll by gently caressing its back with my middle finger. I also got the bluetooth track-pad visible in the picture, and I alternate between that and the mouse. Each is comfortable. The display is excellent, but there's a tiny drawback to such a large one. The Mac OS menu bar is always in the upper left of the screen. When an app's main window is in the center or to the right, and I want to click on the File or Edit menu, it's a loooong way off to the side. I feel like a tennis spectator swiveling my head from left to right. (Finally, a reason to put the menu bar on the app window instead of the screen-top.)

The silver box at the lower left is a NewerTech "mini-stack" with a 2TB hard drive and a Blu-Ray burner. Before retiring the Mac Pro I used Carbon Copy Cloner to duplicate its two drives onto this drive, so every file and app I had accumulated before is still accessible (some of those files date to the 1980s...). Actually the Apple Migration Assistant has become really slick. I just had to give it the password to the household Time Capsule and it simply took over the Time Capsule backup of the Mac Pro and used it to get almost everything I wanted.

I spent some hours installing the latest Python 2 using "brew", to supersede the Apple one, and Python 3.4 from the Mac distribution at Python.org. I installed Wing IDE and spent a little time selecting larger fonts so I could read my code while sitting a couple of feet from the screen. I haven't installed Py/Qt yet; I want to wait for Qt5.4 and the matching PyQt. In the meantime I will get back to developing on my laptop. But by the end of the month I expect I'll be doing most of my development work on the iMac, sitting up at the same desktop where I did PPQT version one and several books.

David Cortesi

Sneaking in development a half-hour at a time. The new metadata.py is coded. Changing it to read and write JSON instead of my own meta-format has greatly simplified the code here, and it will simplify the dozen or so modules that are clients of metadata.py, when I get around to them. But how I'm rewriting its unit test. That involves feeding it bad stuff of various kinds and checking the output log messages. I'm pleased with how specific the diagnostics are, that come out of json.py. Here's an example.

ERROR:metadata:Error decoding metadata:
    Unterminated string starting at: line 2 column 13 (char 13)
ERROR:metadata:Text reads: {"VERSION": "2}

The middle line is the text from ValueError object produced by json.py. The third line comes from my code, which knows the starting point within the string where JSON was looking, and shows from there up to 100 characters further. This should make debugging bad user edits quite easy.

David Cortesi

New post at PGDP forums

At a user's request I posted a discussion of PPQT and ppgen in the ppgen forum topic. It's the first time in a long time I've posted anything at PGDP.

JSON customization

In the last post I noted that the json.dump() could not deal with either byte data or set data. Long-time PPQT supporter Frank replied by email showing me how one could customize the default() method of the encoder to handle these cases, turning a set into a list and a bytes into a string. That automates the encoding process, but decoding back to bytes or set data, he said, had to be handled after the JSONDecoder had run.

Well, not quite. I think I have worked this out to make both encoding and decoding of these types automatic. I must say that the standard library json module does not make this easy; the API is confusing and inconsistent and the documentation while accurate, is not exactly helpful. But here's what I have so far.

Custom Encoding

To customize encoding you define a class derived from json.JSONEncoder. In it you define just one method, default(obj). It receives a single Python object—could be number, string, dict, anything—and it returns an object that can be serialized by JSON. That can be the same object, or a different one. Or, if you don't want to handle it, call super().default(obj) which may or may not raise an error. So here's mine:

class Extended_Encoder(json.JSONEncoder):
    def default(self,obj):
        if isinstance(obj, bytes) :
            return { '<BYTES>' : "".join("{:02x}".format(c) for c in obj) }
        if isinstance(obj, set) :
            return { '<SET>' : list(obj) }
        return super().default(obj)

If obj is a bytes, return a dict with the key <BYTES> and a string value. If obj is a set, return a dict with the key <SET> and a list value.

You might think, if you are defining a custom class, that at some point you would create an instance of said class and use it. But nunh-unh. You just pass the name of the class to the json.dumps() method:

tdict = {
    'version' : 2,
    'vocab' : [
        {'word' : 'foo', 'props' : set([1,3,5]) },
        {'word' : 'bar', 'props' : set([3,5,7]) } ],
    'hash' : b'\xde\xad\xbe\xef'
}
j_st = json.dumps(tdict, cls=Extended_Encoder)

What comes out, for the above test dict, is (with some newlines inserted)

{"vocab": [
  {"word": "foo", "props": {"<SET>": [1, 3, 5]}},
  {"word": "bar", "props": {"<SET>": [3, 5, 7]}}],
"version": 2,
"hash": {"<BYTES>": "deadbeef"}}

Custom Decoding

To customize JSON decoding, you don't make a custom class based on json.JSONDecode. (Why would you want decoding to be consistent with encoding?) No, you write a function to act as an "object hook". You create a custom decoder object by calling json.JSONDecoder passing the object_hook parameter:

def o_hook(d):
    #print('object in ',d)
    if 1 == len(d):
        [(key, value)] = d.items()
        if key == '<SET>' :
            d = set(value)
        if key == '<BYTES>' :
            d = bytes.fromhex(value)
    #print('object out',d)
    return d
my_jdc = json.JSONDecoder(object_hook=o_hook)
decoded_python = my_jdc.decode(j_st)

You call the decode() or raw_decode method of the custom decoder object. During decoding, it passes every object it decodes to the object hook function. The object hook is always called with a dict. The dict results from some level of JSON decoding. Sometimes the dict has multiple items, when it represents a higher level of decoding. Sometimes it has just one item, a JSON key string and a Python value resulting from normal decode, for example {'version':2} from the earlier test data. Or d may be {'<SET>':[1,3,5]}.

The object hook does not have to return a dict. You can return any Python object and it will be used as if it were the result of decoding some JSON. So when the key is <SET> or <BYTES>, don't return a dict, just return the converted set or bytes value.

So, to review:

To customize JSON encoding, you make a custom class with a modified default() method. Then you call json.dumps() passing it the name of your class.
To customize JSON decoding, you define a function and create a custom object by calling json.JSONDecode() passing it your function as an optional parameter, and you call the .decode() method of the custom object.

Yeah, that's clear.

Bullet-proofing Decode

The raw_decode() method takes a string and a starting index. It decodes one JSON object through its closing "}". It returns the decoded Python object and the string index of the character after the decoded object.

I believe I am going to use this to make the PPQT metadata file more error-resistant. My concern is that the user is allowed, even encouraged, to inspect and maybe edit the metadata. But if the user makes one little mistake (so easy to insert or delete a comma or "]" or "}" and so hard to see where) it makes that JSON object unreadable. If all the metadata is enclosed in one big object, a dict with one key for each section, then one little error means no metadata for the book at all. Not good.

So instead I will make each section its own top-level JSON object.

{"VERSION":2}
{"DOCHASH": {"<BYTES>":"deadbeef..."} }
{"VOCABULARY: {
   "able": {stuff},
   "baker": {stuff}...}
}

and so forth. Then if Joe User messes up the character census section, at least the pages and vocabulary and good-words and the other sections will still be readable. This might cause problems for somebody who wants to read or write the metadata in a program. But I think it is worthwhile to fool-proof the file.

David Cortesi

A few quick 'speriments to make sure that json.dumps() can handle all sorts of metadata. Two restrictions show up.

One, it rejects a bytes value with builtins.TypeError: b'\x00\x01\x03\xff' is not JSON serializable. This is an issue because one piece of metadata is an SHA hash signature of the document file. This lets me make sure that the metadata file is of the same generation as the document file. (If the user messed up a restore from backup, for example, restoring only the document but keeping a later metadata, all sorts of obscure failures would follow.) The output of QCryptographicHash(QCryptographicHash.Sha1) is a bytes value.

Two, it rejects a Python set value with the same error. The worddata module wants to store, for each vocabulary word, a set of a few integers encoding the word's properties, e.g. uppercase or mixedcase, contains an apostrophe, contains a hyphen, etc.

The solution in both cases is to ask json.dumps() to serialize, not the value, but the __repr__() of the value. On input, the inverse of __repr__() is to feed the string into ast.literal_eval(), and check the type of what comes out.

I put quite a bit of care into coding the input of the metadata, because I want tell the user to feel free to hand-edit the metadata. If the user can edit the file, the user can screw it up. There's no point in telling the user not to edit the file because she will anyway. Better to document it, and then be very leery of accepting any value it contains. Part of that is using literal_eval(), which checks the syntax of a presumed Python value and will not pass executable code (hence no code injection).

The old metadata format was quite simple. The JSON one, even if I tell it to indent prettily, will be less easy for a user to fiddle with.

Hmmm. Also in the old format, as long as the user didn't mess up one of the section boundaries, he lost at most one section's data. In fact, the error detections I coded into the current code reject only single lines (with a log message). But if the user edits a JSON file and mucks up a syntactic delimiter... Must think about how to contain JSON errors to single sections, and not allow a single deleted "}" or "," to cause the whole file to be unreadable.

David Cortesi

When this arrives it will be a bit of a distraction while I move everything over from my oh-so-tired Mac Pro.

Today and tomorrow I hope to finish up the page table display. Next week the big JSON switch for metadata. Which, hopefully, I will be working on using a 27-inch, 5K-wide retina display...heh heh heh (rubs hands in gleeful anticipation)

David Cortesi

So PPQT stores quite a bit of metadata about a book, saving it in a file bookname.meta. When I created this in version 1, I followed an example in Mark Summerfield's book and devised my own custom file format. (I don't have the book near me on the vacation trip, but skimming the contents online I note that same chapter has topics on reading XML. Probably I looked at that and said "Oh hell no," and quickly cobbled up a simpler "fenced" syntax for my files.).

Everything about reading and writing that format was in one place in V1. From one angle it is a good idea to have all knowledge of a special file format in one place, but it really was a bad idea. It mandated a very long and complex routine that had to know how to pull out and format 8 or 10 different kinds of data on save, and parse and put back those same several kinds of data on load. In hindsight it was more important to isolate knowledge of the particular types of data in the modules that dealt with that data. So, a key goal for the V2 rewrite was to distribute the job of reading and writing types of metadata among the modules that manage those data.

Almost the first code I wrote for V2 was the metadata manager module. This handles reading or writing the top level of my metadata format. The various objects that comprise one open book "register" with the metadata manager, specifying the name of the section they handle, and giving it references to their reader and writer functions.

The metadata code recognizes the start of a fenced section on input and calls the registered reader for that section, passing the QTextStream for the metadata file. On save, it runs through its dictionary of registered section names and calls each writer in turn.

In this way, knowledge of the file organization is in the metadata manager, but all knowledge of how to format, or parse, or fetch or store the saved data is in the module that deals with that data. For example the worddata module stores the census of words in the book. It knows that each line of its metadata section is a word (with optional dictionary tag for words in a lang=tag span) and its count and some property flags. And it knows how it wants to store those words on input. None of that knowledge leaks out to become a dependency in the metadata file code.

As you can maybe tell, I was quite pleased with this scheme. There are a total of sixteen metadata sections in the V2 code, each managed by its own reader/writer pair, and all happily working right now. Some are one-liners, like {{EDITSIZE 13}} to remember the edit font size. Others like the vocabulary are hundreds of lines of formatted records. But all code complete and tested.

So obviously it must be time to rip it all up and re-do it!

A note from my longest and most communicative user asked, oh by the way, it would be really helpful if you could change your metadata format to something standard, like YAML or JSON.

Right after saying "Oh hell no," I had to stop and think and realize that really, this makes a lot of sense. Although I knew almost nothing about JSON, it is such a very simple syntax that I felt up to speed on it in about 20 minutes.

But how to generate it? Would I have to pip-install yet another module? Well, no. There's a standard library module for it. And it looks pretty simple to use. You basically feed a Python dict into json.dumps() and out comes JSON syntax as a string to write to the QTextStream. On input, readAll() the stream and drop it into json.loads() and out comes a Python dict.

This leads to a major revision of the V2 metadata scheme, but the basic structure remains.

The metadata file will consist of a single JSON object (i.e., a Python dict) whose keys are the names of the 16 sections. The value of each section name is a single JSON/Python value. For simple things like the edit point size, or the default dictionary tag, the value is a single number or string. For complicated things, the value is a dict (JSON object) or a list (JSON array).

As before, modules will register their reader/writer functions for their sections. But they will not be passed a text stream to read or write. Instead, each reader will receive the single Python value that represents its section. Each writer, instead of putting text into a stream, is expected to return the single Python value that encodes its section's data.

On input, the metadata manager will use json.loads() to get one big dict. It will run through the keys of the dict, find that key among its registered reader functions, and pass the key's value to the reader.

On output, the manager will run through the registered writers and form a dict with section-name keys and the values returned by their writers. Then just shove the json.dumps() of that dict into the text stream.

It all shapes up nicely in my imagination. Just a whole bunch of coding to implement it. And not just the metadata module and all the reader/writer functions, oh hell no, there are several unit test drivers that made liberal use of metadata calls to push test data into, and check the output of, their target modules. All those will have to be recoded to use JSON test data.

There are other implications of this change to be explored in later posts.

David Cortesi

Wrote loupeview.py from scratch in three short sessions over three days while hanging out on the farm and being sociable.

First I had to install bookloupe. This was nontrivial owing to the readme not being as specific on Mac OS as it might be. Wrote a detailed description for the pgdp forum; maybe it will get picked up and integrated someday. For the moment, bookloupe development appears to be stalled.

I looked over the source and concluded it would need major surgery to turn it into a Python lib via either Cython, SWIG or manual coding. So instead, invoke it via subprocess, and apply it to a temporary file.

The temporary file part, which I had been uncertain about, turned out to be amazingly easy. Qt has already thought of it, and the QTemporaryFile class gives you a temporary file in a platform-independent way. You create the object and open() it. That actually creates the file in some suitable place. From that point, the QTemporaryFile object is an actual QFile, with all its methods. You write to the file and close() it. Then you can use QFileInfo to get the full path to the QFile, and that's your argument to some command line program. When eventually the Q(Temporary)File object is garbage-collected, the actual file is automatically deleted.

I wanted to use subprocess.check_output(). The leading, positional argument to that is a list containing the elements of an executable command. The leading element of the list is the name of the command, or in this case the fully qualified pathname of the bookloupe executable. The user will have to provide that eventually through the still to be written Preferences dialog. But the paths module has a default that works for Mac OS and probably for Linux, so that works for now.

Last in the list comes the full path to the temporary file. Between the head and the tail come the parameters of the command. For bookloupe I wanted to pass a batch of dash-letter parameters. So this was my initial shot at coding this.

        command = [bl_path,'-e -s -l -m -v', fbts.fullpath()]
        # run it, capturing the output as a byte stream
        try:
            bytesout = subprocess.check_output( command, stderr=subprocess.STDOUT )
        except subprocess.CalledProcessError as CPE :
            # display a message with text from stderr
            return # leaving message_tuples empty

This provided a nice test of the code in the except clause, displaying "unknown option -e -s -l -m -v". Oh. Pretty clearly the underlying subprocess.popen was putting quotes around my string instead of simply substituting the string into a command. Oh sigh. With

command = [bl_path,'-e','-s','-l','-m','-v', fbts.fullpath()]

it ran perfectly. Here's a screenshot.

The user has doubleclicked on the mesage line headed "461". The doubleclick signal slot tells the editor to jump to line 461, column 48, where indeed there is a "spaced quote".

Bookloupe, like the gutcheck program it was forked from, produces a great quantity of nit-picky diagnostic messages. The way Guiguts dealt with this was with a special dialog that had one checkbox for each possible message type—40 or more of them. It would display in its report only the messages of the types you'd checked. In practice, you hit the "clear all" button and then checked one box at a time, filtering the report to one message type at a time. I think this way of handling it is simpler and just as effective. You can sort the table on either the message text column or the line number column ascending or descending. So if you prefer you can deal with the diagnostics in sequence from the last line up (preserving the lineation if you make edits). Or you can sort by message texts. Then you can deal with one group of messages at a time, just as with Guiguts but a simpler UI.

David Cortesi

Back in August I made some guesses about what parts I would do in what sequence, and how long those would take. I think I'll review my timeline and see how it's going. Here's what I said I'd do:

✓ Code review of several modules
✓ Revise editview to get rid of QT Creator glop code
✓ Find panel
✓ Chardata and charview panel
✓ Wordview, including good-words drag'n'drop UI

So while that all took about 3 weeks longer than I planned, it is done. Yay me! Remaining, per the August plan, are

pageview, the Page table panel
fnotedata and fnoteview, the Footnotes panel
loupeview, integration with bookloupe
translate, code to parse a DP-formatted book and use it to drive the extremely clever API that I have devised for writing format-translation modules.

I believe I will rearrange the above sequence and do loupeview next. Footnote and pagination are fairly sophisticated proofing tools, while applying bookloupe or some such nit-picking tool is needed when post-processing even simple books. With Find, Word, Char and Loupe you have a toolkit adequate for many simple books.

I remarked in August that regarding bookloupe "there are many unknowns about how to integrate this hunk of somebody else's C code into Python and to display its output in a useful way." That's still very true, however there are at least two tools that might make integration of "somebody else's C" easier. One is SWIG and the other is Cython. There is also the easy way, which is really not easy at all: save the current file to a temp, and use subprocess to run the bookloupe as a command-line utility and capture its stdout stream. This is fraught with host dependency issues like, where to write a temp file?

Anyway, all of that will be a bit delayed as I and spouse are about to head out on a 2-week run in the RV. I may get some coding, and even some blogging, done on this trip, but it will be spotty.

Another interfering factor: in late October (barely a month now), Qt 5.4 will be released, and it contains the new QWebEngine browser. I am looking forward to this because of my other project, the web-comic browser Cobro. I read web comics every day in it, and about 1 day in 3, it crashes somewhere in QWebKit. I don't mind; I just restart it and carry on. But I can't make it available to anyone else when it is so unstable. I have high hopes that by upgrading it to Qt 5.4 (and presumably, PyQt5.4 and an upgraded SIP by then), and dumping QWebKit for QWebEngine, I will get a faster and more stable program.

If so, then CoBro will also be the perfect test bed for me to learn how to use pyqtdeploy, and with it make stand-alone CoBro apps for three platforms. Those two challenges are likely to eat up at least a week, probably more.

But with that done, especially if I can master pyqtdeploy, I will be in position to release a stand-alone, multi-platform alpha of ppqt2 sometime quite early in 2015.

David Cortesi

The answer to the title question is, more often than you thought, probably. Here's how I found out.

So a PGDP thing is the "good-words" file. During the multi-stage proofing process, the proofers can nominate words that fail the online spellcheck for the good-words list, meaning they are correct. Proper nouns, technical terms, archaisms, etc. It's a familiar concept in spell-checking, sometimes called a local dictionary.

A book downloaded by a post-processor usually has a good-words.txt file. When PPQT opens a book for the first time (there's no metadata file to be found), it looks for the good-words file, reads it, and makes sure that every word in it gets a free pass around the spellchecker. The good-words list gets written into the metadata file for later.

A feature of the words (vocabulary) panel is that you can select a word or words and have them added to the good-words list. In V1, this was done via a context menu command. You selected a word or words and right-clicked, and selected "Add to good words" from the popup menu. Then you had to confirm you wanted to do this to an OK/Cancel dialog, because there was no way to undo this step.

It was somewhat tedious. Clearing out all the misspellings is a major task during post-processing. There are often a hundred or more, very few of them actual misspellings. Verifying that they are not, and adding them to good-words, can take a while. So for V2 I was determined there would be a simpler drag-and-drop interface for this. I am showing the actual good-words list as a one-column table alongside the vocabulary table, and the user can just drag a word (or words, complex selections are allowed) into good-words.

When this is done, the "X" in the vocabulary window should immediately disappear, indicate removal of the misspelled status. Here's a short video of how it looks currently.

The play-by-play is: the word LOWENHEIM is marked as misspelled (the X in the Features column). The user drags it to the Good Words list. Immediately the X should go away, but it doesn't. Why not? Because I have not added any code to tell the Words Table View it needs to update that word. I meant to, but hadn't got around to it. However, the instant the words table is scrolled—you can tell there's a scroll happening because OS X displays the scrollbar temporarily—the X does disappear. Why?

Then, the user clicks on LOWENHEIM in the good words list and hits the Delete key. It is deleted from the list. Immediately the X should reappear in the Words table, but again, I haven't yet added code to signal a change from the list widget to the table widget. But again, upon the tiniest scroll movement, the X does reappear. Is it magic?

I conclude that whenever there is even a tiny bit of scrolling, the QTableView calls the data() method of the table model for fresh data. It wouldn't know to do that for the LOWENHEIM row in particular, so it must be doing it for every visible cell! The call to data() fetches the latest info about the word, including its new status as correctly, or incorrectly, spelled.

I had not anticipated this. I had supposed that the X's would not change until the user clicked the Refresh button, causing a table model reset. This would not be a nice UI, so I had expected I would need to add a user-defined signal, "wordChanged" or such, to the Good Words list view widget, and catch that signal in the Words View widget. There it would have to look up the word, get its physical index, get the sort/filter proxy to translate that to the sorted row index, and then it could issue the dataChanged signal to get the table View to call the table Model for fresh data.

Now, I dunno. Do I need to add that machinery? If the user can get the current status just by scrolling even a tiny bit (or by hiding the app and revealing it; that does it also), should I bother?

David Cortesi

I've almost completed testing the word view panel, which displays the vocabulary of words (and word-like tokens) in the document, with their counts and some "properties" such as, are the spelled correctly? The table can be filtered various ways, and in particular the user can opt to show only the misspelled words.

So my test document had some French phrases, which I'd marked with <span lang='fr_FR'>, to request spell-check using the fr_FR dictionary. And this was working beautifully; words from phrases like je suis jeune fille showed up in the vocabulary list as properly spelled.

Except for words with accents: était, majesté and so on were shown as misspelled. Why?

Well, the whole thing reeks of character encoding issues, dunnit? Somewhere in the interface between a call in Python 3 and the C++ wrapper around Hunspell, there has to be an encoding step to get from Python's however-many-bit Unicode (16? 32? variable?) character string, and a C++ char *.

I experimented with encoding the word that I passed, but that only caused more problems. The hunspell call wanted a string, and word.encode(encoding='ISO-8859-1',errors='replace') produces a bytes object. So an immediate Type Error happened.

Then I looked at the hunspell wrapper code, and it uses PyArg_ParseTuple() to receive the word-string from Python. And per its doc (at the link if you care) it says "Unicode objects are converted to C strings using 'utf-8' encoding..."

So my Unicode word était is being properly passed into Hunspell as a UTF-8 string, without effort on my part. Hmmm.

Oh.

I remembered (from the month or so I spent buried in spellcheck technology in 2012, struggling to get spellcheck working in version 1) that the .aff file of a dictionary includes an encoding, specifying the encoding of the matching .dic file. I checked, and in the fr_FR.aff I had picked up (sometime or other, from OpenOffice.org, I think) had this as its opening line: SET ISO8859-15.

Now, if I was writing a spellchecker these days, I imagine I would use that to decode the file but store the decoded words in full Unicode or UTF-8. But just maybe Hunspell wasn't that smart. So I opened the two files in BBEdit (which has a convenient UI for changing the file encoding), changed that line to SET UTF-8 and saved both files in UTF-8.

Problem gone; now all French words from the test doc checked as correct, even those with accents.

So Hunspell was storing the dictionary words as Latin-1 strings, then comparing them to UTF-8 strings, and not surprisingly, getting mismatches. Making the dictionary file encoding match the Python wrapper interface fixed the problem.

Not quite! I can distribute some dictionaries with the program (which I also did with V1) but the user can get more or other dicts from anywhere. As long as they are Myspell/Hunspell compatible, they should work. Except, if they are not encoded UTF-8, they won't. I foresee problems here.

David Cortesi

Qt makes a big deal of using the Model-View architecture for its tables. One creates a table model by customizing QAbstractTableModel. Then one makes the table visible by creating a QTableView and linking it to the customized model.

All well and good, but unfortunately their design leaks view considerations into the model, as I only realized while finishing the character panel. It occurred to me that I had implemented a little character database in chardata.py (as described in the preceding post); so why had I based that class on QObject instead of QAbstractTableModel? I actually started to change this, and then stopped.

You customize QAbstractTableModel by adding overriding definitions of these methods:

rowCount() to return the number of unique rows.
columnCount() to return the number of columns.
data(index, role) to return both data and metadata for one cell.
headerData(index, role) to return both data and metadata for one header cell.

There's no debate about rowCount(); it is certainly the job of the data model to know how many primary keys there are to show. The trouble starts with columnCount(). The number of columns is a matter of how the data are to be presented to the user. As it says in Wikipedia, "the model captures the application's behavior in terms of its problem domain, independent of the user interface." The data model can know how many items are in the tuple related to one key, but how many of those are to be shown in this table, and in what sequence? That's the view's domain.

Things get worse with the data(index, role) method. There's no issue when the "role" passed is Qt.DisplayRole; then the return is one datum from the row. (Although one might quibble that the same datum could be displayed different ways; and this design forces the model to decide how to format each datum.) The issue is that the "role" code passed to data() can also be Qt.ToolTipRole, Qt.StatusTipRole, Qt.TextAlignmentRole, Qt.ForegroundRole and several other "roles" all related strictly to the display of the data.

It is (in my humble opinion) no business of the data model to know whether a given datum should be shown in red or black, left- or right-aligned, or what its tooltip should say.

The real breakdown of MVC is in headerData(index, role). The name at the top of a table column has nothing to do with the data model. I am especially sensitive to this because I am trying to make sure that all user-visible strings pass through QCoreApplication.translate(), so there is some hope of a properly-localized UI. Column header titles like "Symbol", "Count", and "Value" need to be translated. Same for tooltip and statustip strings! But (again, in my so-humble opinion), nothing the data model knows about should ever need translation. Translation should only ever be needed by the user-facing View component.

Tl;Dr: The Qt Model-View architecture forces the table model to perform many view-related things: deciding how to display each datum, providing column header and tooltip texts, and knowing presentation attributes such as color and alignment. That's just wrong.

Not that anything can or should be done about it at this point. I implemented the table model in the charview module.

David Cortesi

The character panel (that I start work on tomorrow) will feature a button named "Refresh" meaning, bring the census of characters in the book up to date. This is implemented in the chardata module I worked on today. Initially I coded refresh() in the simplest way:

        editm = self.my_book.get_edit_model()
        c = self.census # save a few lookups
        self.k_view = None
        self.v_view = None
        self.census.clear()
        for line in editm.all_lines() :
            for char in line :
                n = self.census.setdefault(char,0)
                self.census[char] = n+1
        # Recreate the views used for fast access
        self.k_view = self.census.keys()
        self.v_view = self.census.values()

Get rid of the key- and value-views just in case sorteddict wants to try to update them as keys are added. Clear the sorteddict. Brute-force count all the characters. (editm.all_lines() is an iterator returning the lines of text in the document in order from first to last, as Python strings.) Recreate the views.

When the document managed by the edit model is about 25K characters, calling timeit on this method for four iterations took 0.75 seconds.

When the user opens a book for the first time, there is no metadata, and the character census sits empty until the user clicks Refresh. Then the above logic runs, loading the sorteddict. On a save, the list of characters and counts is written to the meta file, and reloaded when the book is opened again. The user clicks Refresh only after editing, to get an updated list of characters. Thus, almost every time Refresh is clicked, a dictionary exists that is almost complete. Possibly the user has added or eliminated a few characters (converted some non-Latin-1 characters to entity notation, for example); and the counts will be different. But the dictionary exists.

So it occurred to me to wonder whether this might not benefit from a trick I used in the word data Refresh method. If the dictionary exists, i.e. this is not the first time the document has been opened and a character census has previously been taken, don't throw the dictionary away. Go through it and zero all the counts; then take the census; then go through and delete any entries with a zero count. Applying this results in the much more complex method here:

        editm = self.my_book.get_edit_model()
        c = self.census # save a few lookups
        if len(c) : # something in the dict now
            for char in self.k_view:
                c[char] = 0
            for line in editm.all_lines() :
                for char in line :
                    n = self.census.setdefault(char,0)
                    c[char] = n+1
            mtc = [char for char in self.k_view if c[char] == 0 ]
            for char in mtc :
                del c[char]
        else : # empty dict; k_view and v_view are None
            for line in editm.all_lines() :
                for char in line :
                    n = self.census.setdefault(char,0)
                    self.census[char] = n+1
            # Restore the views for fast access
            self.k_view = self.census.keys()
            self.v_view = self.census.values()

Four iterations on the 25K book: 0.21 seconds. Keeping the dictionary and its views intact rather than recreating them saved considerable time. The Refresh operation should take only a barely perceptible delay even in a large book.

Wednesday, December 31, 2014

Tuesday, December 30, 2014

Monday, December 29, 2014

Sunday, December 28, 2014

Saturday, December 27, 2014

Tuesday, December 23, 2014

Monday, December 22, 2014

Friday, December 19, 2014

Friday, December 12, 2014

Looking ahead, again

Saturday, November 29, 2014

Thursday, November 27, 2014

Monday, November 17, 2014

Friday, November 14, 2014

Monday, November 3, 2014

Saturday, November 1, 2014

Saturday, October 25, 2014

Monday, October 20, 2014

New post at PGDP forums

JSON customization

Custom Encoding

Custom Decoding

Bullet-proofing Decode

Thursday, October 16, 2014

Sunday, October 5, 2014

Thursday, October 2, 2014

Monday, September 22, 2014

Saturday, September 20, 2014

Saturday, September 13, 2014

Thursday, September 11, 2014

Search This Blog

Come on in...

Blog Archive