Thursday, May 28, 2015

Parsing a document: 4, yapps2 results

I succeeded in defining a grammar for an extended form of DP document structure in YAPPS2, compiling it, and running it. The output is a fairly simple program of about 200 LOC. The grammar is based on the idea that I will "tokenize" the document to one character per line, with each character representing the structural value of the line: start of a section, end of a section, empty, or text (and an extra "E]" at the end of a footnote, illustration or sidenote, as discussed a couple days back).

Then feed the string of characters to this generated parser. Either it will "recognize" the string as valid, or it will produce a syntax error with enough info that I can tell the user the approximate location where it goes wrong.

Normally a generated parser needs added code to process the string being parsed. But in this case all I want is validation that the structure is correct. Then I can feed the lines to a Translator in sequence. The Translator coder does not need to worry about structure; the Translator can be nearly stateless.

OK, so here is the complete syntax.

%%
parser dpdoc:

    token END: "$"
    token LINE:     "L"
    token EMPTY:    "E"
    token XOPEN:    "X"
    token XCLOSE:   "x"
    token ROPEN:    "R"
    token RCLOSE:   "r"
    token COPEN:    "C"
    token CCLOSE:   "c"
    token TOPEN:    "T"
    token TCLOSE:   "t"
    token POPEN:    "P"
    token PCLOSE:   "p"
    token FOPEN:    "F"
    token IOPEN:    "I"
    token SOPEN:    "S"
    token BCLOSE:   "\\]"
    token QOPEN:    "Q"
    token QCLOSE:   "q"
    token NOPEN:    "N"
    token NCLOSE:   "n"

    rule NOFILL:    XOPEN ( LINE | EMPTY )* XCLOSE EMPTY? {{ print("nofill") }}
    rule RIGHT:     ROPEN ( LINE | EMPTY )* RCLOSE EMPTY? {{ print("right") }}
    rule CENTER:    COPEN ( LINE | EMPTY )* CCLOSE EMPTY? {{ print("center") }}
    rule TABLE:     TOPEN ( LINE | EMPTY )* TCLOSE EMPTY? {{ print("table") }}
    rule POEM:      POPEN ( LINE | EMPTY )* PCLOSE EMPTY? {{ print("poem") }}
    rule PARA:      LINE+ ( EMPTY | END )  {{ print("para") }}

    rule HEAD:      EMPTY {{print('head...')}} ( PARA {{ print( "...3") }}
                            | EMPTY EMPTY PARA+ EMPTY {{ print("...2") }}
                            )

    rule QUOTE:     QOPEN ( PARA | POEM | RIGHT | CENTER | QUOTE )+ QCLOSE EMPTY? {{ print("quote") }}
    rule FIGURE:    IOPEN ( PARA | POEM | TABLE | QUOTE )+ BCLOSE EMPTY? {{ print("figure") }}
    rule SNOTE:     SOPEN PARA+ BCLOSE EMPTY? {{ print("sidenote") }}
    rule FNOTE:     FOPEN ( PARA | POEM | TABLE | QUOTE )+ BCLOSE EMPTY? {{ print("fnote") }}
    rule FZONE:     NOPEN ( HEAD3 | FNOTE )* NCLOSE EMPTY? {{ print("zone") }}

    rule NOFILLS:   ( NOFILL | RIGHT | CENTER | TABLE | POEM )

    rule goal: EMPTY* ( NOFILLS | PARA | HEAD | QUOTE | FIGURE | SNOTE | FNOTE | FZONE )+ END

The print statements in double braces are for debugging; they go away in the final. They could be replaced with other Python statements to actually do something at those points in the parse, but as I said, all I want is to know that the parse completes.

The generated code defines a class "dpdoc". When instantiated it makes a parser for these rules. One passes a "scanner" object when making the parser object. By default it is a text scanner defined in the YAPPS2 runtime module, but mine will be quite different.

The HEAD rule caused some issues. Initially I wrote,

rule HEAD2: EMPTY EMPTY EMPTY PARA+ EMPTY
rule HEAD3: EMPTY PARA

But YAPPS rejected that as ambiguous. It only looks ahead one token. When it sees EMPTY it can't tell if it is starting a HEAD2 or a HEAD3. Eventually I found the solution in the Yapps doc, as shown above.

Tomorrow I have to figure out how to organize the text line information as I "tokenize" it, so as to have it ready to feed into a Translator. And start to implement that tokenizer.

Edit: actually now I see an error, the HEAD3 call in the FZONE rule. That's a relic; there is no HEAD3 production since the HEAD ambiguity was resolved. Bad test coverage! Not sure how to resolve this. May need to actually rely on output from the parser.

No comments: