Sunday, June 22, 2014

More Thoughts on Pandoc, etc. [Updated]

Some additional considerations that have occurred to me since writing the previous post.

Plain Text Output

One output format is absolutely required of the Post-Processor: the complete book as a plain text file. Formerly this had to be ASCII, not even Latin-1, with accented characters in an expanded format like [:u] or [c~]. Nowadays, PPers usually provide the ASCII, plus a Latin-1 or UTF-8 version with accented characters in place.

Regardless of the encoding, the plain etext has no formatting markup. It is simply the text with headings set off by newlines, paragraphs wrapped to a 72-character margin, and any other formatting, like poetry, tables, or centered text implemented with spaces and newlines.

PPQT V.1, like Guiguts before it, does quite a nice job of converting DPM to plain etext. I took considerable pride in implementing the Knuth-Pratt algorithm for optimal paragraph reflow, as a point of differentiation from Guiguts. And PPQT reflows tables (coded with the unique PPQT markup, of which more later) nicely also.

It occurred to me to wonder how well Markdown or any other markup accepted by Pandoc did at this task. And to my surprise, it appears that none of them do it at all!

The venerable Markdown for example plainly states that "Markdown is a text-to-HTML conversion tool." Not a plain-text generating tool, an HTML one, which means it hands off all responsibility for paragraph reflow to the web browser. Similarly AsciiDoc and reStructuredText mention only output to HTML, PDF, EPub and the like. (Well, rST mentions output to Python Docstrings, but doesn't say whether it reflows paragraphs for them.)

It seems quite likely—although I would love to be corrected on this!—that it is not possible to go into Pandoc with any markup and come out with a plain UTF-8 text file acceptable to Project Gutenberg!

Edit: According to someone on the Pandoc mailing list, there is indeed a plain-text output "writer", and pandoc -t plain my-input.txt should produce what I want. I haven't installed an actual pandoc so can't try this, for example I don't know what the paragraph reflow is like, or whether there is any way to control widths. So still not certain if plain text is really feasible from pandoc. To be investigated later this year.

Further edit: my innocent question on the Pandoc list has produced this interesting thread with some knowledgeable comments about the history of PG and its format (esp. its ambiguities, which make it very hard to back-convert PG to some markup, one example, use of CAPS for emphasis), and this from John MacFarlane (Pandoc author): "I think a pg writer is a nice idea. It would be fairly easy, I think, to do.... [it] would involve a new option and a few different behaviors." From this I deduce that the existing "plain" output mode is not fully PG-compliant.

And another edit: On the same message thread linked above, John MacFarlane (Pandoc author) now says, "I've started a gutenberg branch on github. It should be fairly easy to add a writer that uses PG conventions." A Fred Zimmerman, presumably a PG or DP contributor? adds "i'm very interested in the gutenberg branch -- great idea." So this may develop into something good, and soon!

This considerably reduces the value of Pandoc to DP. The plain etext is not a negotiable requirement. Not surprisingly the Python programs that process fpn and fpgen do promise plain-text outputs in addition to Epub, HTML, etc.

The question now is: should PPQT V.2 continue the ability to convert DPM to etext? There's a fair amount of code and GUI widgetry behind it. Or should I just assume everyone will be converting to fp[ge]n markup and getting their etext from the batch programs that support those markups?

Translation UI

The relationship between PPQT and the three competing markups (DPM, fpn, fpgen) is quite unclear to me, as should be apparent. I'm thinking I badly need to know what the potential user community actually needs—indeed, if a user community actually exists at all!. If I don't get some useful comments on this blog, I need to go to the forums and make a nuisance of myself to get some answers.

But supposing a community exists, here's what I think I know. First, DPM will not go away. What I'm calling DPM is the sum of the rules in the Distributed Proofreaders Formatting Guidelines. It is deeply embedded in the whole DP infrastructure. I don't think DP will ever rewrite their guidelines to make the volunteer proofers in the Formatting rounds insert fp[ge]n syntax instead.

If I'm right about that, then DPM is what the PPers will continue to receive as their input. The initial stages of PP work—fixing up things separated by page breaks, double-checking bold and italic markups, renumbering and moving footnotes, and running spellcheck—will be done in the context of a DPM text.

Then, most likely, the PPer will want to do a one-time bulk conversion to one of the fp[ge]n markups. This, PPQT could facilitate, in the following ways.

First, provide "File > Export to" command options. As I noted in the prior post, it would not be difficult to convert DPM to either fp* markup. This command would act like File > Save As, bringing up a file-save dialog to pick a name, and writing a new or replacement file consisting of the active (DPM) document translated to another markup: mybook.txt is saved as mybook.fpn.

Second, have a way the user can opt to save metadata with the translated file, mybook.fpn.meta. The meta file would include the pointers to where page boundaries were (adjusted to still be accurate in the translated source file), as well as the notes, bookmarks, vocabulary etc.

This pretty much means that you could now open the fp* markup file and still have your scan images, your notes, word and character tables... not sure about the footnotes. But you could go on editing as before.

No comments: