Monday, May 25, 2015

Parsing a document: 1, a document structure syntax

First thing today I sat down and worked out a BNF-style notation for the document structure I want to support. This is the structure that is only implicit in the DP "Formatting Guidelines" but with block structure augmented.

As a parsing problem this is unusual in that most parsers and parser documentation are focused on parsing tokens that are substrings in a line, for example, the tokens within a statement like foo=6+bar*3. In defining the structure of the DP document the tokens are not characters but entire lines. For example one "terminal" production—comparable to the id "foo" in the preceding statement—is a no-reflow section defined as

/*
...any amount of lines ...
*/

My first thoughts were based on the idea that I could, for purposes of validating the structure, just reduce the entire document to a string of letters, one letter per line. Suppose for example that

  • an empty line --> E
  • /* --> X
  • */ --> x (note, lowercase)
  • etc for others
  • a nonempty line not otherwise classified --> L

Given that, a rule for a no-fill production would be X(L|E)+x. The only other such structure that the guidelines allow is /#...#/ for a block quote. This means "reflowed text that is indented". The guidelines never seem to have envisaged what else might appear inside a block quote.

I would add support for /C...C/, no-reflow but centered; /R...R/, no-reflow but right-aligned; /T..T/ for tables, really just no-reflow sections but needing special handling a Translator; and /P...P/, Poetry, in which each line is treated as a separate paragraph, but leading spaces are retained and a line can be reflowed if it is too long for the current width, but then with a deep indent for the "folded" lines. Now: can any of these appear inside a block quote? Inside each other?

An additional problem arises with three block sections that the guidelines treat in a different way: Illustrations, Sidenotes, and Footnotes. In each case the block begins with left-bracket and a keyword. The block can end on the same line or on a later line; the end of the block is a line that terminates with a right-bracket. But the content of that line before the right bracket is part of the text.

[Illustration: Fig 32: A short and snappy caption.]
[Illustration: Fig 33: A ponderous and lengthy and
especially, long caption that might even include...
...wait for it...
/#
Yes! A Block Quote!
#/
And who knows what else?]

This causes a problem as compared to the other block sections: their delimiters are whole lines, where these blocks are delimited by parts that appear on the same line(s). It turns out that for easiest processing, one would like to treat them as if they were broken out on separate lines with an extra empty line. For example, the one line [Footnote B: Content of this note.] is best encoded as if it were

[Footnote B:
Content of this note.

]

And of course if I were scanning the document and building this string, I could do just that: put out at least four characters for a Footnote: perhaps F to start it, then characters representing its content line(s), then a E and a right-bracket.

Empty lines cause some concern because, unlike the usual computer grammar that treats newlines as just more whitespace, they are semantically meaningful. A paragraph is one or more non-empty lines that terminates with an empty line (or the end of the file, or the right-bracket of a Footnote or Illustration...). A level-2 head (a.k.a. Chapter title) begins with four empty lines, may contain multiple paragraphs and ends with two empty lines. A level-3 or subhead begins with two empty lines and terminates with one.

Also, users are instructed to precede markup openings like /# with an empty line, and to follow a markup close like #/ with one. But that means the both the paragraph and any markup section "eats" its following empty line, so that in fact a Head-2 is signaled by three (not four) empty lines, one surely having been eaten by the preceding construct whatever it was.

Well, that said, with all the above caveats, here is a draft document syntax.

# The nofill/right/center/table sections may only contain L and E, 
# to have any other letter like P in them shows an error.
# All the multiline sections absorb a following empty line but
# don't insist on it.

rule Nofill : X[LE]+xE?
rule Right  : R[LE]+rE?
rule Center : C[LE]+cE?
rule Table  : T[LE]+tE? # Table cells can't have Poetry, etc

# Poems can only have lines and blank lines, no /C etc.
# If you want a centered Canto number or right-aligned attribution,
# insert P/ and restart the poem on the next stanza.

rule Poem   : P[LE]+pE?

# A paragraph absorbs the following empty line

rule Para   : (L+E) | (L+$) # $ == end of file

# Assert: every Head2/3 is preceded by some other element that
# eats a terminal E.

rule Head2  : EEE(Para)+E
rule Head3  : E(Para)

# A block quote is allowed to contain text, right/center aligns,
# Poetry or A NESTED QUOTE. Arbitrarily ruling out no-fill and Tables.

rule Quote  : Q(Para|Right|Center|Poem|Quote)+qE?

# A side-note should be just a phrase but who knows? Anyway,
# only Paras.

rule SNote  :  S(Para)+]E?

# Figure captions may contain Quotes, Poems, or Tables. No other
# figures or Footnotes.

rule Figure :  I(Para|Poem|Table|Quote)+]E?

# Footnotes same.

rule FNote  :  F(Para|Poem|Table|Quote)+]E?

# A footnotes "landing zone" can have a Head3 and FNotes, or nothing

rule NoteLZ :  N(Head3|FNote)*nE?

With this more or less nailed down I started reading up on parser generators in Python. There is a helpful table of them in the Python wiki and by the end of the day I'd gotten through reading the docs on maybe half of them. More on that tomorrow.

No comments: