Monday, June 1, 2015

Parsing a document, 7: testing (and ranting at Enum)

I am now beginning to test my document-parsing code and things are going very well. It amounted to about 250 LOC (plus the code generated by YAPPS, another 200 or so). For first execution of brand new code things went well. After I picked off 6 or 8 stupid coding errors (like: defining some regexes as class members and forgetting to use "self." to reference them) and a couple of small logic errors, it is happily parsing a simple document.

One problem I ran into that took a bit of finagling was this. The generated parser comes from a "grammar" file. I've shown some preliminary grammar code in previous posts. One tricky production is the one for heads:

rule HEAD:      EMPTY {{print('head...')}} ( PARA {{ print( "...3") }}
                            | EMPTY EMPTY PARA+ EMPTY {{ print("...2") }}
                            )

The items in {{double braces}} are Python statements which YAPPS will insert into the generated parser code at the point where parsing reaches that part of the production. In that code the statements are print() calls. But what I really needed was this:

rule HEAD:      EMPTY {{ open_head() }} ( PARA {{ close_head(3) }}
                            | EMPTY EMPTY PARA+ EMPTY {{ close_head(2) }}
                            )

In other words, call functions of mine that will set up a start-heading work unit, and, when the type of heading is known—only after processing the paragraph(s) of text within the heading—back-patch the open-head unit with the type of head it turned out to be, and append the close-head unit.

Well, that code died with an exception because "function open_head() not found." Wut? I was importing the parser with:

from dpdocumentsyntax import DPDOC

which should make the parser class part of the active namespace where the functions like open_para() were defined. But no. I tried several ways to work around this. You can include blocks of code in the generated parser, but if I defined the helpers like open_para() there, they could not see the globals like the WORK_UNITS list they had to modify. Eventually I had to do it in a not very pretty way,

import dpdocumentsyntax
dpdocumentsyntax.open_para = open_para

That is, manually inserting those definitions into the imported namespace.

Anyway, as it parses, the code builds a list of "work unit" objects that will eventually be fed to a Translator as "events". A typical sequence of work units, or events, would be,

  • Open head(2) (Chapter head)
  • Open paragraph
  • Line (text: "CHAPTER ONE")
  • Close paragraph
  • Close head(2)
  • Open paragraph
  • Line (text)
  • Line (text)
  • Close paragraph

And so forth. There are all told 30 different possible "events" and I expect to pass each to a Translator with a code signifying what kind of event it is, e.g. Open Paragraph, close BlockQuote, or Open Illustration Caption, etc. So how should these codes be defined? Obviously there must be names for them, like OPEN_PARA, CLOSE_FNOTE and so forth. And obviously these will be in a module the Translator can include, perhaps so:

from xlate_utils import EVENTS

Then the coder can make decisions by comparing to EVENTS.OPEN_PARA and the like.

Looks like a job for an Enum, right? The Enum "type"—it isn't a type—was added to Python in version 3.4, and having played with it, I cannot fathom why they bothered. It has to be the most useless piece of syntax ever. But check this out.

from enum import Enum
class ECode( Enum ):
  VAL1 = '1'
  VAL2 = '2'
ECode.VAL1
<ECode.VAL1: '1'>
'1' == ECode.VAL1
False
edict = { ECode.VAL1: 1, ECode.VAL2: 2 }
edict
{<ECode.VAL2: '2'>: 2, <ECode.VAL1: '1'>: 1}
edict['1']
Traceback (most recent call last):
  File "<string>", line 1, in <fragment>
builtins.KeyError: '1'
ECode.VAL1 < ECode.VAL2
Traceback (most recent call last):
  File "<string>", line 1, in <fragment>
builtins.TypeError: unorderable types: ECode() < ECode()

Now, for something completely different:

class HCode( object ) :
  VAL1 = '1'
  VAL2 = '2'
HCode.VAL1
'1'
HCode.VAL1 == '1'
True
hdict = { HCode.VAL1 : 1, HCode.VAL2 : 'to' }
hdict
{'1': 1, '2': 'to'}
hdict[ '2' ]
'to'
hdict[ HCode.VAL1 ]
1
HCode.VAL1 < HCode.VAL2
True

What I'm saying is that a simple class definition accomplishes everything that the "Enum" class does, and also has "real" values that can be compared and ordered. There is the one tiny drawback that a user could assign to HCode.VAL1 but that can, I believe, be prevented by adding a decorator.

So I will be providing the 30 event codes as a class EVENTS that is really a class and performs what a C header file does: give names to arbitrary literals.

No comments: