Saturday, August 15, 2015

Translators: Updating HTML, adding PPgen

By PPQT2 to the Woodshed

I used PPQT2 to post-process quite a bulky project, Hawkins Electrical Guide Vol. 3. This is the kind of PP project I've always enjoyed, with varied document structure (not just chapters of paragraphs) and many images. In this case, over 200 images, on which I spent many hours in Photoshop making clear, clean yet very compact .png files.

I had been working on this book ever since PPQT2 was at all usable, a year ago or so. After finishing the ASCII and HTML translators, I could finalize this book using PPQT2. Which I did, and uploaded it, and ran into an eagle-eyed and extremely conscientious PPVer, who kicked it back to me with a list of over 50 issues to correct. Properly taken to the woodshed, I was!

Some of the issues related to the generated HTML, and in the process of correcting them I realized some ways in which the HTML translator could do a better job. Also I had spent some time absorbing the various EPUB advice pages in the DP Wiki, and realized the impact that EPUB has on the post-processor's view of HTML.

EPUB Rant

Parenthetically, DP has a confused relationship to EPUB. Project Gutenberg now routinely does a batch conversion of the submitted HTML book using something called EPUBmaker, and it does a number on one's HTML. In prior years I, like many PPers, have spent lots of time on tweaking the HTML to make the ebook look very much like the printed book. But Epubmaker ruthlessly throws away most of that, leaving a flat, boring, ugly etext.

Double-parenthetically, part of the problem is the many restrictions of the EPUB format itself. It doesn't allow floats—so forget about sidebars, side-notes, and running text around small images. It doesn't support pop-up title=texts when you hover the mouse on an element—so forget about showing the original spelling of a typo, or showing the transliteration of a Greek or Cyrillic word. It imposes ridiculous constraints on images; nothing wider than 600px and no image files larger than 200K. Like other stupidly-designed standards, it takes the historical limitations of the ebook readers of 2005 and codifies them for all time. Do you think a retina iPad can't display an image larger than 600px? Or a Kindle Fire? The EPUB standard is very much like the many state laws that codified the design of auto headlights in the 1950s, based on the then state of the art, the sealed-beam unit. So when European cars started using replaceable halogen bulbs, they could not be imported to the U.S. because their headlights were not sealed-beam units. It took decades to get the laws changed so imported cars didn't have to have inferior U.S. headlight units retrofitted before they could be sold. EPUB does exactly the same thing, locking us into an already-outmoded technology. Close inner parenthesis.

DP's response to EPUB has been scattered and slow. There are several different Wiki pages about it, giving conflicting advice and often referring to forum threads that are years old. But the bottom line is, the PPer today who spends any time on how the HTML looks is wasting her energy. The majority of PG downloads are for EPUB, not HTML, and all your pretty CSS will be stripped out by Epubmaker. Close outer parenthesis!

HTML changes

With all this in mind, I went back to the HTML translator and made changes. I simplified the CSS in the header block a lot, removing many options and comments on appearance. I changed the method of encoding visible page numbers from the Guiguts method to a method that was recommended in one of the EPUB Wiki pages, as possibly able to survive Epubmaker.

Another change was from percentage widths to fixed widths. PPQT lets the user specify margins in ASCII space units, for example /Q First:6 Left:4 Right:4. These translate nicely in the ASCII output. But for HTML, I had been converting them to percentages of a 75-character line, so that Right:4 became margin-right:5%. But percent widths are relative to the container, so 5% is less in a nested container than at the outer level.

There was already a historical conversion of 2 ASCII spaces == 1 HTML "em" unit; this had been in use for poetry line indents for years in the Guiguts HTML conversion, and my HTML Translator did the same thing for poetry. Well, why not for all widths? So I changed it to use em units for everything, and Right:4 becomes margin-right:2em; which is the same regardless of context.

Ppgen

The afternoon after I posted the updated HTML Translator I was congratulating myself on the PPQT2 design that makes the Translators into separate files, and how easy it was to update just that file without having to repackage the whole app. And then about how nobody has expressed any interest in doing any other Translator. And how there really ought to be a Ppgen one.

I've had a Chrome window open for months, with about six tabs open pointing to different Ppgen docs. (Which, parenthetically, are badly organized and incomplete.) Well, crap, I said to myself, let's see how hard it would be. I pulled up a copy of my skeleton Translator file and started filling in the 30-odd entries in the "computed go-to" list of API "events". And it went very well. A majority of events are either null, or can be handled by a single literal string without any functional logic. For example, the OPEN_H2 event just squirts out .h2.

By end of the day I had almost all of it coded, lacking only the table-related events, and I pretty well see how to implement them.

So early next week I reckon I will be able to announce a trial Ppgen Translator. I'll have to hedge the announcement with many caveats, mostly because I do not have the actual Ppgen batch tools installed so I can't actually test that my translation produces usable output. But if people don't like it, they can fix it. It's just a small Python source file; be my guest.

And when that's finalized, people will be able use PPQT2 to complete Ppgen-based projects. Which might increase adoption.

No comments: