Tuesday, May 26, 2015

Parsing a document: 2, parser generator or own code?

So the question still open is, should I use a parser generator to make a dedicated parser that could validate the document structure according to the syntax I described yesterday? Or, should I just hack out my own recursive-descent parser?

Yesterday's post has a link to a table with 20 or so different Python-based parser generators. I'm still going through them one by one, reading their documentation, trying to decide if (and how) I could use them.

A parser generator is basically a stand-alone Python program that reads a syntax description and writes a module of code that can parse that syntax. One hopes the parsing would be fast, and when an error is found, it is reported with some accuracy. Another requirement for me, is that the generated parser be pure Python with no dependencies on C modules. The generator itself can have such dependencies because it would not be distributed; only the generated parser would be part of the product.

There are some advantages to using a generated parser. First, accuracy. One would expect that the parser would correctly parse exactly the specified syntax. Any errors would be in the syntax itself, not the parser. Second, modifiability. If for some reason the syntax needs to be changed in any way, it's just a matter of editing a syntax file and generating a new parser, and that's that. No code changes. Finally, error reporting. In principle the parser can report the location of an error, or at least the location it finds the error, accurately.

The alternative is to write my own parser, guided by the syntax. It would loop over the lines of the document, pushing things onto a stack when it sees an opening tag like /* and popping them off when it sees a closing tag. It would implement the rules of what can appear inside what by testing against some sort of table. It would probably implement things like "swallowing" the E? optional empty line after certain productions using some hack or other to save time.

This approach is tempting just because I know from experience that the huge majority of documents are nearly flat with very little nested structure and few errors. So an actual parse is in a way, overkill. But a hand-made parser is also the mirror image of the generated parser: the syntax rules are distributed through 500-odd lines of code and hard to change, as well as impossible to certainly verify. Error reporting might or might not be as good.

I've got to finish surveying the parser generators, pick the likeliest one, and do maybe a few small experiments to understand how to use it, before I decide.

No comments: