While waiting for a response from the chap I hope will join me in making PyInstaller 3, I decided it was time to do a bit of a test of PPQT2 usability. That means, post-processing a book myself. So I downloaded an example of the kind of book I have enjoyed doing in the past, in this case, one volume of Hawkin's Guide to Electricity.
I also opened the "Suggested Workflow" document that gets distributed as an "extra". It outlines the steps of post-processing a book with specific guidance on using PPQT features at each step. I knew it needed updating for V2, and I'm keeping it open alongside the PPQT window (27-inch iMac heh heh heh) and referring to it as a very handy task list. And correcting it as I go.
One of the first things I ran into is the small usability problem posed by V2's policy on file encoding. In V1 I really tried to support every common file encoding. There was a File:Open With Encoding sub-menu that allowed opening Latin-1, Mac Roman, Windows CP 1252, and UTF-8. There was sneaky undocumented support for other encodings.
For V2 I decided, basta, the world has moved on, UTF-8 is it. So PPQT2 assumes any input file, the book text and the good_words and bad_words files, are going to be UTF-8. It will also accept ISO-8859-1 aka Latin-1 but there's no menu command for that; you have to tell it by renaming the files to, e.g. bookname-ltn.txt or good_words.ltn. Similarly for saving: if you save the file to a name ending in -ltn or .ltn, it will write ISO-8859-1. Otherwise, it writes UTF-8. (Obviously if you know the file should be pure ASCII, it doesn't matter on input or output.)
So Bibimbop, my one tester, hit that right away, complaining that the good-words list was full of � replacement characters. Right, that's what happpens when Latin-1 is read through the UTF-8 codec. But then I did the same thing to myself! The book I downloaded was, as almost every book from Distributed Proofreaders U.S. will be, Latin-1 with a filename of projectidxxxxxxxx.txt. I opened it without renaming it, and it had replacement chars.
This was a good thing! Because I immediately realized how the user should handle this, and I put the following text into the Suggested Workflow as "Step 2: First Open".
Start PPQT2. Use File:Open to open the book. Its text appears in the Edit panel.
Select the Chars panel and click Refresh. Click the Symbol column heading to make the table sort from high numbers to low (if it is not that way already). Note if the highest-numbered symbol is � (0xfffd, the replacement character).
If the book has any of these, it was opened with the wrong encoding. Close the book immediately without saving! Perhaps it is a Latin-1 file opened as UTF-8, or it has some other encoding such as Windows CP-1252. Find out how it is encoded. If it is Latin-1, rename the file appropriately. Otherwise, use some other utility to convert it to UTF-8. Then start this step over.
On the whole, this seems a bit squiffy to me. On one hand, I definitely do not want to support other encodings. People should use UTF-8, period. Also on that hand, this is a one-time thing; it only affects the user the very first time they open a book. Once they have resolved the encoding issue for this, it is not an issue again. (Well, unless they insist on Latin-1 and enter characters that doesn't support, but that's not my problem, and the Chars tab has a tool for finding those.)
On the other hand, it lays a bit of a trap for the user, a wee pitfall that I even fell into myself. And there's potential data loss. If you open a Latin-1 book as UTF-8, which PPQT2 will happily do, and then save it, you have just trashed your file. You've written a UTF-8 file full of replacement chars where you used to have nice accented characters.
So what to do? The app knows very well when a book is being opened for the first time (there's no metadata file). Should I put up some kind of warning dialog? Should I quickly run through the text looking for \ufffd chars and then put up a warning? Maybe that.
This is what dogfood snacks are all about.
 
 
No comments:
Post a Comment