Saturday, June 21, 2014

Looking ahead to Pandoc

This is to order my thoughts about output formats and markup systems, and to gather links to these in one convenient place.

History

PPQT is intended to support the work of volunteers finishing etexts for Distributed Proofreaders, aka PGDP. PGDP was one of the first "crowd-sourced" volunteer sites on the internet, organizing thousands of volunteers to find the typos in OCR images of public-domain texts, one page at a time. At the end of the process, a different set of volunteers, the "post-processors" or PPers, have the job of splicing together the individually-proofed pages of each book to make one smooth etext. That's the task that PPQT aimed to assist.

The original PGDP workflow ended with an ASCII etext, no more. There are hundreds (thousands?) of PGDP-proofed etexts at Project Gutenberg. By 2002 or so, most PPers also prepared HTML versions of their texts. And in recent years there's been demand for other formats such as EPUB.

Markup Systems

A text passing through PGDP gets formatted with a particular markup style documented in the Formatting Guidelines. Although PGDP did not label the guidelines as a "markup system" that is what they constitute: a set of rules for representing a book's typography and layout in a plain text document. PGDP never gave their markup system a catchy name; let's call it DPM.

dpm

DPM can be compared to other plain-text markups such as Markdown and reStructured Text. It comes off quite well in these comparisons. The other markups were devised by (mostly) programmers for use in (mostly) documenting code, they don't support typography beyond emphasis, and layout beyond code-blocks. Some of the things that DPM supports and others do not include footnotes, poetry (in the sense of being able to specify line breaks and indentation), and simple right-alignment of text, as in a citation within a block quote.

fpn

In recent years, PGDP volunteers motivated in part by the need to auto-convert etexts to new formats such as EPUB (and in part, I'm sure, by simple N.I.H. syndrome), have devised new markup styles. One is fpn devised by Robert Frank (rfrank at PGDP) and announced in February 2014. This markup uses different syntax to support the features of DPM, and adds a number of minor features. In general Robert Frank favored a terse syntax reminiscent of 1980s TROFF syntax. It would not be difficult to convert a DPM-marked text to one that is marked up with basic fpn using search and replace; for example chapter heads in DPM are marked with four newlines, and in fpn with a leading .h2.

fpgen

A bit earlier, in July 2013, the independent volunteers of PGDP Canada announced their own new markup style, fpgen. Documented in the DP-canada WIKI, fpgen is also the work of an "rfrank", in this case Roger Frank, who tended to favor an XML-like bracketed syntax. Again it would not be difficult to convert a DPM text to an fpgen one; for example a DPM chapter head marked with four newlines would become <heading level='1' id="ch01">Head Text</heading>

my-dpm (blush)

I am not immune to N.I.H. and the temptation to define markup syntax. In PPQT version 1 I supported a number of extensions to DPM, including right-aligned text and a syntax for tables. I designed these features based off of PPQT's model, Guiguts, which had its own simple extensions of DPM. For example, in DPM a block quote is

/Q
Quote text...
Q/

Guiguts extended this to allow specifying the first, left, and right indents so:

/Q[8,4,12]
Quote text with 8-char first indent, 4-char left indent, 
and 12-char right indent...
/Q

My version supported in PPQT V.1 allowed instead,

/Q F:8 L:4 R12
Quote text with 8-char first indent, 4-char left indent, 
and 12-char right indent...
Q/

Guiguts had a simple ASCII table markup; I extended it with additional syntax for column alignments and widths. I also added /R..R/ for right-aligned text and /C..C/ for centered text.

What to support with PPQT?

In a way, the choice of markup hardly matters, because the markup disappears before the book reaches its destination at Gutenberg.org. A marked-up document is a transient state between the original OCR text and the final etext/html/EPUB files. So the choice of markup is merely a convenience for the PPer. It is a way for her to encode decisions about how the book should be formatted: these lines are a poem, these lines are a table; this is emphasized text, etc.

However, the choice of markup is controlled by the software used for creating the final output. Robert Frank has a Python program to convert an fpn text to EPUB and HTML. Roger Frank of PGDP Canada has a, guess what, Python 3 program to convert fpgen to EPUB and HTML.

And both Guiguts and PPQT V.1 have code to convert DPM to HTML.

What should PPQT V.2 do? Should it contain code to convert fpn or fpgen to some other output format? Should it retain the V.1 HTML converter? Or should it be markup-agnostic?

Agnosticism

By markup-agnostic I mean, have only the features needed to make a clean job of finalizing an etext,including:

  • Support for image display alongside text,
  • proofer's notes saved in the metadata,
  • an extensive find/replace,
  • the character and word tables (so important for spell-check and finding other missed errors),
  • the Footnote panel with essential aids for cleaning and renumbering footnotes,
  • automatic calling of gutcheck or (better) bookloupe and a tabular display of the resulting diagnostics,

And just stop there, and say: ok, PPer, now you have a smooth DPM text, you can go on to use the editor and regex find/replace to convert this to any markup you like, and save the file, and process it using software from whomever.

Translators

Another option would be to offer automated translation from DPM to fpn and/or fpgen. And further, it would be possible to foist the job of coding those translators off on the people who want those markups. I've already floated this as an idea to DP Canada: that they could write the fpgen-erator to some API that I could provide.

Enter Pandoc

And then, there's Pandoc. Pandoc is a universal markup-translator. It reads texts in a variety of markup styles, and it writes output in an even wider variety including EPUB, LaTex, and PDF. It is widely used and widely praised, and in principle, could completely replace the programs written by both rfranks, generating any possible desired output format from code that is widely used and supported by an active community.

All that is needed is a way to get a post-processed etext into Pandoc. Unfortunately although Pandoc accepts a number of markups, that list does not include dpm, fpgen or fpn.

Pandoc does offer two general input formats. One is its own internal format represented in JSON format. The other is its own extended Markdown. This markup—let's call it pem for Pandoc's Extended Markdown—supports everything that dpm supports. It does not support quite everything that fpn supports, and I'm not sure about whether it is a superset of fpgen or not.

I can picture PPQT supporting a batch conversion of dpm to pem, for example as a command under the File menu: File > Save to Pandoc, and this as a replacement for the old HTML conversion step. This would convert a dpm text to a pem text and write it. However that wouldn't be much use to a PPer who didn't have access to Pandoc, because only Pandoc supports pem.

If I can work out how to distribute a Pandoc executable with PPQT in any platform, I can also imagine having automated "Save to Epub" and "Save to HTML" commands, that would generate a pem stream and feed it down a pipe to a Pandoc command, with the output to the designated file.

What About HTML?

PPQT V.1 has only one aid for HTML, the HTML Preview panel. When editing an HTML document, you can get a rendering of it in a QWebFrame. It's only a bit more convenient than saving the file and opening it in a separate browser (you avoid having to do a ctl-s in PPQT, click in the browser, click the reload button, then click back to PPQT to edit).

But should V.2 have any HTML support at all? The point in editing HTML inside PPQT was that the auto-converted HTML is awful to look at and needed a lot of hand-tweaking and customizing. But when HTML conversion is pushed off to an external program, whether that is Pandoc or an effort by one of the rfranks, is there any point in editing the resulting HTML output? Or is one supposed to do confine one's tweaking to the marked-up fp[ge]n file, and treat the HTML as a write-only output?

And supposing PPQT retained its HTML preview panel, should it also have a preview panel for EPUB? (Is that possible?) It would be kind of slick if, when you opened a file with an html suffix, you automatically got an HTML preview panel on the right, while if you opened a file with a .mobi suffix, you got an EPUB preview panel...

Also given HTML support of any kind, what about W3C Validation? Back when I assumed V.2 would have its own HTML conversion, I also speculated that it should support an automatic upload to the validation site, automatic download of the error list, and display of that in a panel such that you could click on a diagnostic and jump the editor to the referenced line. Is that still useful when HTML is being generated by an external program? Or is validation and W3C conformity now the responsibility of the external program?

I welcome the comments of any of my readers on these issues.

No comments: