Sunday, March 1, 2015

Reinventing the wheel

For PPQT2, a "Translator" is, or will be, a single module of Python code. It will need to be self-contained or nearly so; the only things it can import are modules that are also imported by PPQT. Those are the modules that are bundled with PPQT for distribution. But the work it has to do is such that few if any imports are needed. Basically it is a markup-to-markup transformer: it receives the semantic elements of a book text as formatted to DP standards, and it emits those elements decorated as appropriate for its output markup style, for example, HTML. Or it might be the "fp" markup whose increasing use in the DP community is what made me come up with the whole idea of modular translators in the first place.

Well, code that takes designated elements of text in and writes marked-up text out is not a new concept. Probably the best current model is Pandoc. It consists of a large repertoire of document readers, each a module that reads some style of markup and emits semantic elements in JSON format, and an equally large repertoire of writers, each reading those elements and writing a document marked up in some style. In principle Pandoc can convert just about any of the dozens of current markup styles to any other markup style.

For some time I considered using Pandoc itself. I would write a single built-in translator that would emit the Pandoc intermediate form. The user could save that as a file and then process it through Pandoc to get any other style, including LaTex and EPUB. Basically I would have added the "DP Formatting Guidelines style" as another input markup form to Pandoc.

I reluctantly decided against this approach for several reasons. First, I could not extract a clean, comprehensible specification of the Pandoc intermediate format from the Pan-documentation. Prof. MacFarlane seems to think that the code is self-documenting. The answer to any detailed question is, "read the code." Sorry, life is too short.

Second, Pandoc is written in Haskell. I certainly did not want to require my users to have to install Haskell, or to distribute Haskell myself as part of PPQT. It is not clear if Pandoc binary executables are available for my target platforms. (Everyone on the Pandoc mailing list seems to have Haskell installed.) Even if there are binaries, I don't really want to have to distribute another bulky binary, and try myself to train my users to use it.

And, finally, there are no Pandoc writer modules for two of my target markup styles, "fp" and "fpgen" from DP Canada. If a translator is in Python, there is some faint hope I can get another person to write a translator (fp and fpgen are both implemented in Python). But if writing a translator means writing in Haskell, it would all be down to me. I have no problem with functional programming—the first four years of my programming career I was in a group writing an APL interpreter, after all!—but the idea of learning Haskell and supporting modules written in Haskell, combined with the distribution problems just mentioned, put the lid on that idea's coffin.

So I decided that PPQT would take the general concept of Pandoc but implement it internally. Back in Version 1, I had code to parse a well-formed book and reduce it to elements. It was the front end to the "ASCII reflow" and "Automatic HTML" features, which I now recognize as translators. In V1, the translation process was embedded in the main program; and the translation was in-place: its output replaced the contents of the current book. The conceptual differences for V2 are three. First, the output of translation goes to make a new file instead of replacing the old one. This is enabled by the V2 ability to have multiple open books. After a successful translation, the user finds a new tab in the Edit panel. It contains the translated text, which can be inspected and saved, or discarded.

Second, there will be a clear, arms-length API between PPQT and the translators. PPQT hands a translator a file-like object to write into, and feeds it with document elements one at a time. At the end, the contents of the "file" (a MemoryStream instance) go to create the new document.

Third, the translators will be dynamically loaded. I know how to do this, even from a bundled application. At startup, PPQT looks in, I think, the Extras folder to find translators, and populates a File > Translate sub-menu with their names. When the user selects one, the fun begins.

This is all preamble to what I was going to write about. We can get to the real "reinventing the wheel" part tomorrow.

No comments: