Friday, May 29, 2015

Parsing a document: 5, first coding bits

I have started the process of integrating a generated parser into my still-growing translators.py module. The big task is to override YAPPS's default "scanner" class with one of my own. The default scanner takes a string or a file and gives its associated parser characters on request. But in my case, the characters are distillations of the lines of the document. It turns out I only need to override one method, grab_input().

The interface between the parser and the scanner is not exactly a clean one. The scanner maintains a member "input" which is a string, and a member "pos" which is an index to the next unused char of the string. The parser increments the scanner's pos member as it matches tokens. When it has caused pos>=len(input), it calls grab_input(). That method is supposed to adjust pos and input so that pos<len(input).

In my case, I will usually set pos=0 and set input to a single character, the code for the current line's contents. There are a few cases where I put more than one code in input.

I have this about 3/4 coded, including the code to save the non-empty lines as "work units" ready to hand to a Translator. When the parse of the document is complete, there will be work units for all the lines in a list. The parse having succeeded, I can take the work units and shove them at the translator one at a time.

I was slowed down a bit today. I started adding an enum to the code, and discovered that my laptop was still on Python 3.3, so "import enum" didn't work. So I had to stop and install Python 3.4. But then I realized, oh doggone it, now I don't have any of my third-party libs like regex or hunspell, so I had to install them. Or mostly just copy them from the 3.3 site-packages to the 3.4 one. But it took some time.

I still need to fiddle with the document syntax, mostly in order to insert bits of Python code at significant transitions. Then I can let the parser discover things for me. For example, when the parser knows it is starting a heading, I can generate a "Open Heading" work unit, and when the parser finds out which kind of heading it is, I can update that work unit.

Anyway, tomorrow or Monday I will have this to a state where I can actually execute it. Hopefully by the end of the week I'll be able to finalize the Translator API and start coding a test Translator.

No comments: