PPQT presents
- A table of the words in the document with their properties such as uppercase, numeric, misspelled.
- A table of the characters in the document with their counts and Unicode categories
- A table of the book pages, derived from the original PGDP page-separator lines
Each of these tables is derived from a "census" in which every line, word-token, and character in the document is counted. In v.1 this census is done the first time a book is opened, and any time after when the user needs to "refresh" the display of word or character counts. It's very time-consuming, 5 to 20 seconds for a large book. Getting the time down for v.2 would be a good thing. So would avoiding a big delay during first opening of a new book.
In v.1 the census is done in one rather massive block of code that fetches each line from the QTextDocument in turn as a QString and parses each, counting the characters and using a massive regex to pick out wordlike tokens. This process is complicated by:
- The need to handle the PG codes for non-Latin-1 chars such as [oe].
- The need to recognize HTML-like productions: some like <i> and <sc> are common from the start, and later in the book-production process there might be thousands of HTML codes; we count them for characters but not for "words".
- But also the need to spot the lang=code property embedded in HTML codes, to signal use of an alternate spelling dictionary.
For v.2 I want to break up the management of all these metadata along MVC lines, with a "data" module and a "view" module for each type, so worddata.py manages the list of words while wordview.py contains the code to present that data using a QTableView and assorted buttons. Similarly for chardata/charview and pagedata/pageview. But will this complicate the census process? Will it slow it down?
Complicate it? Not exactly; more like "distribute" it. I will move each type of census to its data model: worddata will take a word census, chardata a char census, pagedata a page census. So a full census could potentially entail three passes over the document.
However, when this separation is done, it becomes clear that the only census that really needs to be done the first time a book is opened, is the page census. That's because the module that displays the matching page scan image as the user moves through the text, needs to know the position of each page's start. In other words, pagedata is the data model for both the page table and the image-display panel. Images need to be displayed immediately, so the page data needs to be censused the first time a book is opened.
The word and char censii, however, can wait. The char data is the model only for the Char panel. If that panel is showing an empty table, the user knows to click its "Refresh" button to make a char census happen, so the table updates.
The word data is the model for the Word panel, and again, if the user opens a new book and goes to the Word panel, and sees an empty table, it's a no-brainer to click Refresh and update the table. In either case, the user knows they've asked for something, and should be content to wait while the progress bar turns and the census finishes.
The word data is also the model, however, for the display of misspelled words with a red underline, and the display of "scannos", highlighted document words that appear in a file of likely OCR errors. These features of the editor are turned on with a menu choice (? or perhaps a check box in v.2? TBS). If either highlighter is set ON when a new book is opened, the highlights won't happen because the word data isn't known until a census is taken.
Easy solution: we know when we are opening a new book (we don't see a matching metadata file from a prior save), and in that case we force OFF the spellcheck and scanno highlight choices. Then if/when the user clicks spelling or scanno highlights ON, we can run a census at that time. Again the potentially slow process is initiated by an explicit user action.
What about (perceived) performance? It should be snappier. If you Refresh the Chars panel it will rip through the document counting characters, but not spend time on the big word-token regex html-skipping process. Refresh the Words panel and its census will at least not be slowed by counting characters.
Great, but I already started coding worddata on the assumption it would base both chars and words. Now I have to split it up.
No comments:
Post a Comment