Friday, May 29, 2015

Parsing a document: 5, first coding bits

I have started the process of integrating a generated parser into my still-growing translators.py module. The big task is to override YAPPS's default "scanner" class with one of my own. The default scanner takes a string or a file and gives its associated parser characters on request. But in my case, the characters are distillations of the lines of the document. It turns out I only need to override one method, grab_input().

The interface between the parser and the scanner is not exactly a clean one. The scanner maintains a member "input" which is a string, and a member "pos" which is an index to the next unused char of the string. The parser increments the scanner's pos member as it matches tokens. When it has caused pos>=len(input), it calls grab_input(). That method is supposed to adjust pos and input so that pos<len(input).

In my case, I will usually set pos=0 and set input to a single character, the code for the current line's contents. There are a few cases where I put more than one code in input.

I have this about 3/4 coded, including the code to save the non-empty lines as "work units" ready to hand to a Translator. When the parse of the document is complete, there will be work units for all the lines in a list. The parse having succeeded, I can take the work units and shove them at the translator one at a time.

I was slowed down a bit today. I started adding an enum to the code, and discovered that my laptop was still on Python 3.3, so "import enum" didn't work. So I had to stop and install Python 3.4. But then I realized, oh doggone it, now I don't have any of my third-party libs like regex or hunspell, so I had to install them. Or mostly just copy them from the 3.3 site-packages to the 3.4 one. But it took some time.

I still need to fiddle with the document syntax, mostly in order to insert bits of Python code at significant transitions. Then I can let the parser discover things for me. For example, when the parser knows it is starting a heading, I can generate a "Open Heading" work unit, and when the parser finds out which kind of heading it is, I can update that work unit.

Anyway, tomorrow or Monday I will have this to a state where I can actually execute it. Hopefully by the end of the week I'll be able to finalize the Translator API and start coding a test Translator.

Thursday, May 28, 2015

Parsing a document: 4, yapps2 results

I succeeded in defining a grammar for an extended form of DP document structure in YAPPS2, compiling it, and running it. The output is a fairly simple program of about 200 LOC. The grammar is based on the idea that I will "tokenize" the document to one character per line, with each character representing the structural value of the line: start of a section, end of a section, empty, or text (and an extra "E]" at the end of a footnote, illustration or sidenote, as discussed a couple days back).

Then feed the string of characters to this generated parser. Either it will "recognize" the string as valid, or it will produce a syntax error with enough info that I can tell the user the approximate location where it goes wrong.

Normally a generated parser needs added code to process the string being parsed. But in this case all I want is validation that the structure is correct. Then I can feed the lines to a Translator in sequence. The Translator coder does not need to worry about structure; the Translator can be nearly stateless.

OK, so here is the complete syntax.

%%
parser dpdoc:

    token END: "$"
    token LINE:     "L"
    token EMPTY:    "E"
    token XOPEN:    "X"
    token XCLOSE:   "x"
    token ROPEN:    "R"
    token RCLOSE:   "r"
    token COPEN:    "C"
    token CCLOSE:   "c"
    token TOPEN:    "T"
    token TCLOSE:   "t"
    token POPEN:    "P"
    token PCLOSE:   "p"
    token FOPEN:    "F"
    token IOPEN:    "I"
    token SOPEN:    "S"
    token BCLOSE:   "\\]"
    token QOPEN:    "Q"
    token QCLOSE:   "q"
    token NOPEN:    "N"
    token NCLOSE:   "n"

    rule NOFILL:    XOPEN ( LINE | EMPTY )* XCLOSE EMPTY? {{ print("nofill") }}
    rule RIGHT:     ROPEN ( LINE | EMPTY )* RCLOSE EMPTY? {{ print("right") }}
    rule CENTER:    COPEN ( LINE | EMPTY )* CCLOSE EMPTY? {{ print("center") }}
    rule TABLE:     TOPEN ( LINE | EMPTY )* TCLOSE EMPTY? {{ print("table") }}
    rule POEM:      POPEN ( LINE | EMPTY )* PCLOSE EMPTY? {{ print("poem") }}
    rule PARA:      LINE+ ( EMPTY | END )  {{ print("para") }}

    rule HEAD:      EMPTY {{print('head...')}} ( PARA {{ print( "...3") }}
                            | EMPTY EMPTY PARA+ EMPTY {{ print("...2") }}
                            )

    rule QUOTE:     QOPEN ( PARA | POEM | RIGHT | CENTER | QUOTE )+ QCLOSE EMPTY? {{ print("quote") }}
    rule FIGURE:    IOPEN ( PARA | POEM | TABLE | QUOTE )+ BCLOSE EMPTY? {{ print("figure") }}
    rule SNOTE:     SOPEN PARA+ BCLOSE EMPTY? {{ print("sidenote") }}
    rule FNOTE:     FOPEN ( PARA | POEM | TABLE | QUOTE )+ BCLOSE EMPTY? {{ print("fnote") }}
    rule FZONE:     NOPEN ( HEAD3 | FNOTE )* NCLOSE EMPTY? {{ print("zone") }}

    rule NOFILLS:   ( NOFILL | RIGHT | CENTER | TABLE | POEM )

    rule goal: EMPTY* ( NOFILLS | PARA | HEAD | QUOTE | FIGURE | SNOTE | FNOTE | FZONE )+ END

The print statements in double braces are for debugging; they go away in the final. They could be replaced with other Python statements to actually do something at those points in the parse, but as I said, all I want is to know that the parse completes.

The generated code defines a class "dpdoc". When instantiated it makes a parser for these rules. One passes a "scanner" object when making the parser object. By default it is a text scanner defined in the YAPPS2 runtime module, but mine will be quite different.

The HEAD rule caused some issues. Initially I wrote,

rule HEAD2: EMPTY EMPTY EMPTY PARA+ EMPTY
rule HEAD3: EMPTY PARA

But YAPPS rejected that as ambiguous. It only looks ahead one token. When it sees EMPTY it can't tell if it is starting a HEAD2 or a HEAD3. Eventually I found the solution in the Yapps doc, as shown above.

Tomorrow I have to figure out how to organize the text line information as I "tokenize" it, so as to have it ready to feed into a Translator. And start to implement that tokenizer.

Edit: actually now I see an error, the HEAD3 call in the FZONE rule. That's a relic; there is no HEAD3 production since the HEAD ambiguity was resolved. Bad test coverage! Not sure how to resolve this. May need to actually rely on output from the parser.

Tuesday, May 26, 2015

Parsing a document: 3, trying out YAPPS2

I have winnowed down the list of 34 (yes you read that correctly) parser generators to a very short list of ones that (a) are pure Python, (b) are documented in a readable and complete fashion, (c) appear to allow the user to tinker with the tokenizer—as opposed to being locked-in to parsing strings or files. That group is:

I've spent the last two afternoons reading parser docs until my eyes bleed and their features are starting to run together. It would be a fascinating exercise, and useful to the Python community, to spend a few weeks really sorting out those offerings and put together a paper with comparative code examples and timings and such. I don't have time to do it well even for the short list above. Maybe someday.

Anyway to commence I thought I'd generate a parser using YAPPS2, and trace through the code of the generated parser and really get a handle on what it does. So I downloaded it. First thing to note is that the download link in PyPi doesn't really go to a download page, but to a page that points in a confused manner several directions: to a Debian package, another Debian package "with improvements", and to a Github repo that is supposedly Python 3 compatible. But it isn't. But there's a link to a set of patches for the Github code that fixes quite a few Python 3 issues, notably print statements. But it wasn't complete; very shortly after applying it I ran into an unfixed "except Exception, e" and soon after, another unfixed print statement. So it's an adventure getting it going.

But I got it to where I could begin to try the first example in the manual. Which is clearly very old, because this is supposedly the YAPPS2 manual, but the example has you "import yapps"—it has to be "import yapps2" now. And that did not work, but immediately stopped with an undefined name. Exploring, it turns out that the code is such that the hand-execution shown in the manual (start python and type "import yapps2; yapps2.generate('filename')") cannot possibly work. A critical statement "from yapps import grammar" is only executed when yapps2.py is run from the command line.

OK the generate step now reads the "calc" (basic calculator) example definition and writes a small and readable python program. Which upon execution reveals several more Python 2/3 issues, including use of raw_input and some more print statements. But when I manually fixed those, it actually worked, reading expressions, parsing them, and printing the results.

My brain is a bit fried at this point; gonna take a nap now; tomorrow is Museum work day; resume this on Thursday.

Parsing a document: 2, parser generator or own code?

So the question still open is, should I use a parser generator to make a dedicated parser that could validate the document structure according to the syntax I described yesterday? Or, should I just hack out my own recursive-descent parser?

Yesterday's post has a link to a table with 20 or so different Python-based parser generators. I'm still going through them one by one, reading their documentation, trying to decide if (and how) I could use them.

A parser generator is basically a stand-alone Python program that reads a syntax description and writes a module of code that can parse that syntax. One hopes the parsing would be fast, and when an error is found, it is reported with some accuracy. Another requirement for me, is that the generated parser be pure Python with no dependencies on C modules. The generator itself can have such dependencies because it would not be distributed; only the generated parser would be part of the product.

There are some advantages to using a generated parser. First, accuracy. One would expect that the parser would correctly parse exactly the specified syntax. Any errors would be in the syntax itself, not the parser. Second, modifiability. If for some reason the syntax needs to be changed in any way, it's just a matter of editing a syntax file and generating a new parser, and that's that. No code changes. Finally, error reporting. In principle the parser can report the location of an error, or at least the location it finds the error, accurately.

The alternative is to write my own parser, guided by the syntax. It would loop over the lines of the document, pushing things onto a stack when it sees an opening tag like /* and popping them off when it sees a closing tag. It would implement the rules of what can appear inside what by testing against some sort of table. It would probably implement things like "swallowing" the E? optional empty line after certain productions using some hack or other to save time.

This approach is tempting just because I know from experience that the huge majority of documents are nearly flat with very little nested structure and few errors. So an actual parse is in a way, overkill. But a hand-made parser is also the mirror image of the generated parser: the syntax rules are distributed through 500-odd lines of code and hard to change, as well as impossible to certainly verify. Error reporting might or might not be as good.

I've got to finish surveying the parser generators, pick the likeliest one, and do maybe a few small experiments to understand how to use it, before I decide.

Monday, May 25, 2015

Parsing a document: 1, a document structure syntax

First thing today I sat down and worked out a BNF-style notation for the document structure I want to support. This is the structure that is only implicit in the DP "Formatting Guidelines" but with block structure augmented.

As a parsing problem this is unusual in that most parsers and parser documentation are focused on parsing tokens that are substrings in a line, for example, the tokens within a statement like foo=6+bar*3. In defining the structure of the DP document the tokens are not characters but entire lines. For example one "terminal" production—comparable to the id "foo" in the preceding statement—is a no-reflow section defined as

/*
...any amount of lines ...
*/

My first thoughts were based on the idea that I could, for purposes of validating the structure, just reduce the entire document to a string of letters, one letter per line. Suppose for example that

  • an empty line --> E
  • /* --> X
  • */ --> x (note, lowercase)
  • etc for others
  • a nonempty line not otherwise classified --> L

Given that, a rule for a no-fill production would be X(L|E)+x. The only other such structure that the guidelines allow is /#...#/ for a block quote. This means "reflowed text that is indented". The guidelines never seem to have envisaged what else might appear inside a block quote.

I would add support for /C...C/, no-reflow but centered; /R...R/, no-reflow but right-aligned; /T..T/ for tables, really just no-reflow sections but needing special handling a Translator; and /P...P/, Poetry, in which each line is treated as a separate paragraph, but leading spaces are retained and a line can be reflowed if it is too long for the current width, but then with a deep indent for the "folded" lines. Now: can any of these appear inside a block quote? Inside each other?

An additional problem arises with three block sections that the guidelines treat in a different way: Illustrations, Sidenotes, and Footnotes. In each case the block begins with left-bracket and a keyword. The block can end on the same line or on a later line; the end of the block is a line that terminates with a right-bracket. But the content of that line before the right bracket is part of the text.

[Illustration: Fig 32: A short and snappy caption.]
[Illustration: Fig 33: A ponderous and lengthy and
especially, long caption that might even include...
...wait for it...
/#
Yes! A Block Quote!
#/
And who knows what else?]

This causes a problem as compared to the other block sections: their delimiters are whole lines, where these blocks are delimited by parts that appear on the same line(s). It turns out that for easiest processing, one would like to treat them as if they were broken out on separate lines with an extra empty line. For example, the one line [Footnote B: Content of this note.] is best encoded as if it were

[Footnote B:
Content of this note.

]

And of course if I were scanning the document and building this string, I could do just that: put out at least four characters for a Footnote: perhaps F to start it, then characters representing its content line(s), then a E and a right-bracket.

Empty lines cause some concern because, unlike the usual computer grammar that treats newlines as just more whitespace, they are semantically meaningful. A paragraph is one or more non-empty lines that terminates with an empty line (or the end of the file, or the right-bracket of a Footnote or Illustration...). A level-2 head (a.k.a. Chapter title) begins with four empty lines, may contain multiple paragraphs and ends with two empty lines. A level-3 or subhead begins with two empty lines and terminates with one.

Also, users are instructed to precede markup openings like /# with an empty line, and to follow a markup close like #/ with one. But that means the both the paragraph and any markup section "eats" its following empty line, so that in fact a Head-2 is signaled by three (not four) empty lines, one surely having been eaten by the preceding construct whatever it was.

Well, that said, with all the above caveats, here is a draft document syntax.

# The nofill/right/center/table sections may only contain L and E, 
# to have any other letter like P in them shows an error.
# All the multiline sections absorb a following empty line but
# don't insist on it.

rule Nofill : X[LE]+xE?
rule Right  : R[LE]+rE?
rule Center : C[LE]+cE?
rule Table  : T[LE]+tE? # Table cells can't have Poetry, etc

# Poems can only have lines and blank lines, no /C etc.
# If you want a centered Canto number or right-aligned attribution,
# insert P/ and restart the poem on the next stanza.

rule Poem   : P[LE]+pE?

# A paragraph absorbs the following empty line

rule Para   : (L+E) | (L+$) # $ == end of file

# Assert: every Head2/3 is preceded by some other element that
# eats a terminal E.

rule Head2  : EEE(Para)+E
rule Head3  : E(Para)

# A block quote is allowed to contain text, right/center aligns,
# Poetry or A NESTED QUOTE. Arbitrarily ruling out no-fill and Tables.

rule Quote  : Q(Para|Right|Center|Poem|Quote)+qE?

# A side-note should be just a phrase but who knows? Anyway,
# only Paras.

rule SNote  :  S(Para)+]E?

# Figure captions may contain Quotes, Poems, or Tables. No other
# figures or Footnotes.

rule Figure :  I(Para|Poem|Table|Quote)+]E?

# Footnotes same.

rule FNote  :  F(Para|Poem|Table|Quote)+]E?

# A footnotes "landing zone" can have a Head3 and FNotes, or nothing

rule NoteLZ :  N(Head3|FNote)*nE?

With this more or less nailed down I started reading up on parser generators in Python. There is a helpful table of them in the Python wiki and by the end of the day I'd gotten through reading the docs on maybe half of them. More on that tomorrow.

Friday, May 22, 2015

Translator options dialog working

So I took my own advice and created a class to represent a dialog option. The author of a Translator codes some of these in his module as global variables, e.g.

import xlate_utils as XU
MAX_LINE_QUERY = XU.Dialog_Item( kind='number', label='Max Line',
                                tooltip='Maximum line width, default is 72',
                                minimum=50, maximum=90, result=72 )
...
OPTIONS_DIALOG = [ MAX_LINE_QUERY, ... ]

The Dialog_Item class definition has only one method, __init__ and it has about 120 lines of code that is mostly validation of the parameters as being the proper types. Errors get logged but some kind of object is always created that the rest of the code can display, if only as a QLabel "Bad Definition Here".

When the Translator is requested, the first thing that happens is to look into the loaded translator module for OPTIONS_DIALOG as a list of Dialog_Item objects. If it's there, a dialog is prepared. Here's one of the test cases in action.

So that all went together nicely. The code to build the dialog, display it, and capture the values set by the user is all quite compact.

Well, that was the fun part. The next part, which is really central to all Translators, is code to parse a document in DP format and reduce it to elemental parts, which can then be fed to any Translator for conversion. "Here's a Chapter Head, convert it. Here's a sub-head, convert it." And so on. But it goes much deeper than that, especially since I want to support nesting of, for example, a Poetry section inside a Footnote or Illustration, and so on. It's non-trivial.

I did this in V1 in a kind of hand-crafted semi-intuitive way. But the only consumer of the output of that parser was my own code for ASCII reflow and HTML conversion. Now I have to look to a generic consumer, a coder who is not me. And I do not want to expose any of the PPQT internals, like the way a document is stored, to the Translator. It needs a very clean, arms-length API. And I would like the parser to have a better foundation in computer science instead of just a big hacky loop with a bunch of state variables, as in V1.

I've been pointed to a partial DP document parser that I need to read. And I have some ideas of how to go at it. But it's a big and critical chunk of work that I need to do next.

Tuesday, May 19, 2015

A dict has no attrs -- or actually, it does

So I'm starting to code the Translator interface. I've been planning this and making notes on it for months and it's fun to start making it real. Also fun to be back into code-and-test mode after a long spell just flogging installation and bundling problems.

I mean to offer each Translator a simple way to query the user for options. The Translator module uses a simple, static, declarative API to describe what it needs to know from the user. (I talked about this earlier but I've made the API simpler and nicer since.) When the user calls for that translation, I'll whomp up a QDialog on the fly with the necessary widgets—QCheckbox, QSpinbox, QLineEdit—show them to the user, and stow the user's input back where the Translator can refer to them. I'm almost ready to start testing this support except I ran into something that's making me rethink the details of the API. And realize the frustrating limitations of the Python namedtuple class.

What I'm currently asking the coder to do is to describe each dialog item as a dict, for example to ask for a yes/no choice,

OMIT_TABLES = {
    "Type" : "Checkbox",
    "Label" : "Omit tables?",
    "Tooltip" : "Check this if the translation should skip /T tables",
    "Result" : False
}

That's a nice enough API. Except for the visual effect of all the quotes, which make it look like it needs a shave! So I was writing the code to validate one of these. I dare not assume my client Python coder has done it right, so I need to check everything. Does it have all, but only, the keys it should have? Is everything that is supposed to be a string, a string?

I'd been writing code to interrogate the imported Translator module, which is a Python namespace. You use hasattr() and getattr() for this. So in writing the code to check that one of these dicts was all correct, I wrote things like if hasattr(item, 'Result')... which seemed very natural, but darn, it didn't work.

The hasattr() call didn't throw an error and complain that it wasn't applicable to dictionaries. It just returned False on every test I wrote. OK, I understand that the right way to know if a dict has a certain key is to write if 'Result' in item.... But why didn't hasattr() complain, if it didn't "do" dictionaries?

The answer, I think, is that everything in Python is an object of a class. The hasattr() function interrogates objects, and a dict is an object of class dict. It actually has some attributes such as __repr__ and the like. But its keys are not attributes. It's just that I am trying to use a dict as if it were a "record" in the Pascal sense, which isn't its intended use.

So I thought to myself, is there something that is more like a record, or some way I could make this API more like an old assembler macro call? Well, there is the namedtuple. Using a namedtuple I could do something like this:

## the translator-coder would import a support module with...
from containers import namedtuple
dialog_item = namedtuple('dialog_item',['type','label','tooltip','result'])
## in the Translator module the coder could then write,
OMIT_TABLES = dialog_item(type = 'checkbox',
                          label = 'Omit tables?',
                          tooltip = 'Check this if the translation should skip /T tables',
                          result = False)

Which is an even nicer interface, has less of the fuzzy look. Bonus for me, there's less to check, as there's no question of wrongly spelled keys. But a big problem, there's no way to omit any keys, either. A namedtuple "factory" like dialog_item above will throw an error if it is called with one of the defined keys not supplied. That's not good, because for example, the tooltip should be optional. And some dialog types need additional fields (like "min" and "max" on the Number type) that should be omitted for the other types.

Well, heck. All that namedtuple() is doing, is declaring a class. It's a meta-class, it generates class definitions. So I could with a bit of thought, just code up a class definition with its own initializer that did allow attributes to be optional. Which would, one, be cleaner than making the coder write all those quote-marks; and two, allow me to do validity checking at declaration time.

So back to the drawing board on this API.

Sunday, May 17, 2015

"Watson" reads this blog...

Following a link from a thread on Reddit, I found out I could put a sample of my writing into a linguistic analyzer powered by the IBM "Watson" computer (or is it an algorithm?)

Upon reading 500 or so words from the previous post, "Watson" concludes that:

You are inner-directed, shrewd and can be perceived as critical. You are authority-challenging: you prefer to challenge authority and traditional values to help bring about positive changes. You are independent: you have a strong desire to have time to yourself. And you are reserved: you are a private person and don't let many people in. Your choices are driven by a desire for revelry. You consider achieving success to guide a large part of what you do: you seek out opportunities to improve yourself and demonstrate that you are a capable person. You are relatively unconcerned with tradition: you care more about making your own path than following what others have done.

Not sure where that "desire for revelry" comment comes from. Otherwise—fair enough...

Also: I put in samples from a couple of other blogs (J.T. Eberhardt's and Dana Hunter's) and the results were extremely different from the above. So its analysis may not be "true" but it is certainly non-trivial.

Friday, May 15, 2015

Finally! All platforms bundled.

Thursday, my Elance contractor delivered hunspell for both 32- and 64-bit Windows and Python 3.4. That dropped in and worked fine, and PPQT ran nicely from the command line. So then I could begin working with PyInstaller on Windows 7. Yesterday and today I discovered and circumvented four bugs in it, all unique to running under Windows, or the combination of Windows and Python 3.

First off there were two of the "hook" files that used a wrong path to import the PyInstaller windows "utils" module. Clearly at some recent point that module was moved within the PyInstaller folder, and somebody forgot to update all the hook files.

Next, it ran but the bundled app couldn't start, "module SIP not found". Now, every PyQt5 module needs SIP (the C++ shim that PyQt uses to cross from Python to the Qt binaries), and every one of the several PyQt5 "hook" modules that was being called named "SIP" as a hidden-import. Why wasn't it being bundled? I did not resolve this question, but I did circumvent the problem simply enough: I just added --hidden-import=sip to the PyInstaller invocation line. That was all it took to make the bundled app run, and wasn't that a lovely sight?

While investigating that, I tried to use the pyi-archive_viewer script that is included with PyInstaller. It lets you examine a bundled app to see what was actually included in it. Or it should; but I quickly found that it couldn't execute one of its basic functions, because it was trying to compare a user input string against a class member that was in bytes format. In Python 2, that worked. In Python 3 it doesn't, because Python 3 requires a clear distinction between strings of bytes and strings of characters, which are Unicode. It's one of the most common issues when converting from Python 2 to Python 3, and this comparison had been overlooked. I reported it and applied a quick one-line source change to get around it.

Once I patched that point in the archive viewer, it immediately turned up another error: when it tried to open a sub-archive it threw a run-time error exception because some "magic number" that it used as a signature didn't match. I traced this far enough to see that the magic number calculation had a three-level if statement, in principle saying "if this is Python 2, do it this way; elif this is Python 3 and the version is less than or equal to 3.3, do it that way; else it's Python 3.4 or above and do it this other way." I'm pretty confident I'm the first person to try this code on Windows and Python 3.4, so I just opened an Issue pointing to that code. Having gotten around the missing SIP problem, I no longer needed the archive viewer so I moved on.

One more step, then. I could bundle to a folder; but could I bundle to a single file .exe? Preferably one using my cute little Marvin icon? So I ran PyInstaller with that option—and it threw an exception. Oh, pooh. The exception was another very typical Python 3 compatibility problem, "str type does not support buffer protocol". This error gets thrown whenever you try to feed a string type to a file that has been opened with the "b" raw-bytes mode. In Python 2 you could do that because both str and byte types were aliases for a C char *. In Python 3, bytes still means that, but str means "16- or 32-bit Unicode characters" and you can't just feed them into a bytes file. You have to tell Python how to encode the characters into a byte-stream, for example by coding bytes(str_var.encode('UTF-8')).

I traced the error to a call to the win32api module. PyInstaller was trying to update a portion of the Windows "manifest" (whatever that is) using the UpdateResource Win API call. The win32api module is another open-source project; PyInstaller is just using it. And that module was accepting a string type as an argument to this UpdateResource method, and then (it appears) trying to feed that string into a file opened as bytes, and causing an exception. The bug is in that module. But I circumvented it by changing the code of PyInstaller so that it passed the string encoded to bytes.

And with that monkey-patch thrown on, it ran and produced a lovely single-file PPQT2.exe file with cute little Marvin icon.!

So now I have successfully bundled the app for all three platforms and put them up on my Public dropbox folder. There is nothing but nerves standing between me and announcing the availability of the alpha test publically. I will probably wait until Monday to do that.

Tuesday, May 12, 2015

Importing translators!

Ran up a little test app to make sure I know as much as I thought I knew about dynamic importing. Everything I know about this, I learned from working on PyInstaller. It has a library of "hooks", small modules that modify the loading process for a specific module. When it finds an import for modname, PyInstaller looks in its hooks folder for a file hook-modname.py. If there is such a hook, it loads that file of Python code into a namespace and looks at the namespace for certain things, such as a global "datas" that can be a list of data files to be loaded when any bundled app imports modname, or a function "hook" that it can call to edit the importation of modname.

It was knowing about this general pattern — load source into namespace, interrogate namespace attributes, call functions in namespace — that made me confident that I could support a variable number of "translator" modules, and even permit users to add new translators in the field.

In the actual app, the File menu will have a sub-menu "Translators". This sub-menu will be prepared at startup. The main window will call a function that populates a QMenu with names of translators. Here is the approximate code of that process.

The outer function will get the Extras path (as set in the Preferences) and look in it for a folder "Translators". It makes a list of all items in that folder and passes each to the following.

    def add_xlt_source( fpath ) :
        if not os.path.exists( fpath ) : return
        if not os.access( fpath, os.R_OK ) : return
        fname = os.path.basename( fpath )
        if not ( fname.endswith( '.py' ) or fname.endswith( '.pyc' ) ) : return
        # It exists, is readable, and ends in .py[c]. Try to load it into
        # a Python namespace.
        xlt_loader = importlib.machinery.SourceFileLoader( fname, fpath )
        print( 'getting namespace', fname)
        xlt_namespace = xlt_loader.load_module()
        # if it is a proper Translator, it has a global MENU_NAME
        if hasattr( xlt_namespace, 'MENU_NAME' ) :
            act = submenu.addAction( xlt_namespace.MENU_NAME )
            act.setData( xlt_namespace )
            act.triggered.connect( run_xlator )
            submenu.setEnabled( True )

The key is the one statement xlt_namespace = xlt_loader.load_module(). This performs an import. It executes all the statements in that source file. (Some of those might raise exceptions, so probably that statement should be in a try/except block.) The returned value is a Python namespace that represents everything declared in that module: its global variables, its classes, and its defined functions.

One can interrogate the namespace with hasattr(). In this case, a Translator has to define a global that is a string (this should be tested!) to use as the menu choice that invokes that translator.

If the module passes this test, the code makes a QAction with the name from the module and adds that action to the sub-menu. The menu action's "triggered" signal is pointed at a function to handle invocation of that translator, and the namespace itself is stored in the action as arbitrary data.

Here's the current stub of the run_xlator function.

        space = self.sender().data()
        print( space.MENU_NAME, getattr( space, 'DATA', '(no data)' ) )

This is a "slot" invoked from the "triggered" signal that is generated when the user selects that item on the menu. It must be part of a QObject-derived object. It can call QObject.sender() to get a reference to the object that created the signal, which in this case can only be the QAction from the sub-menu. The QAction has a data() method that returns the namespace that was stored in it with setData(). That's everything defined in the module that was loaded, so here we print two global values, one we know exists, and one that is optional.

For test purposes I've set up two files in the Translators folder. One is not a Translator,

'''
Test module that is NOT a translator.
'''
print('Non-translator module executing anyway!')

The other one is.

'''
test translator module
'''
MENU_NAME = 'Wahoo!'
DATA = 'Some Data'
print('Wahoo executing!')

Here's the output of a test.

getting namespace not_a_translator.py
Non-translator module executing anyway!
getting namespace xlt_wahoo.py
Wahoo executing!
Wahoo! Some Data  (from run_xlator)

Monday, May 11, 2015

Keep the blog alive...

I follow Carl Claunch's Rescue1130 blog. Besides enjoying the technology he writes about, I have admired his persistence in blogging daily, often 7 days a week. He has a full-time job that involves frequent travel, plus volunteer work at the Computer History Museum, and this absorbing hobby of restoring the 1130, yet he has been posting something every day.

Well, this week he has suddenly went silent. He posted Friday, but not Saturday or Sunday or... wait, that's it. Today is Monday. I found myself actually worried; is he well? Which is stupid, the man has probably just taken a weekend off (no doubt to the relief of his long-suffering family).

Then I realized with some guilt, that after a fair spell of posting most weekdays, I went silent. What about my loyal readers? Are they concerned for my health, or irked at my laziness? Blogging is a responsibility!

Right, so the most significant thing I've been doing the past week is contributing, in a clumsy and halting way, to the PyInstaller project. After a long period of relative quiescence, it suddenly sprang to vigorous life in the past month. Several contributors began posting issues and pull requests to fix their issues. The lead maintainer, Hartmut, became extremely active in response, commenting on the issues and pull requests, rejecting some, accepting others. Most importantly, he did the job of rebasing the dormant Python3 branch onto the current Develop head, so it picked up the maintenance it had missed.

So I tried to use it, and found several minor bugs which I fixed and got a complete working PPQT2 bundle for both Mac OS and for Ubuntu 14.10 (32-bit and 64-bit).

Then I put those fixes into a pull request, but it wasn't right, so Hartmut very patiently directed me in how to make it right, and after maybe three tries, it worked and those fixes are now in the official Python3 branch of PyInstaller, yay me.

Only bundling for Windows remains and I can announce an alpha version of PPQT2. The Windows bundle was a major hurdle for V1, mainly because PPQT requires the Hunspell spell checker. Unlike the other packages PPQT needs (regex, natsort, sortedcontainers) which are pure Python, pyhunspell is a Python-to-C++ wrapper over the API to the Hunspell library. Which means its main component is a C++ source that has to differ between Python2 and Python3, because the Python-to-C API changed between versions. There's a user-patch that supposedly does that; but it is not at all clear whether that patch applies to the current source. And then the source has to be compiled with MSVC at a particular level (2010, 64-bit) to match the level used to compile the official Python 3.4 release. I understand all these words but have zero experience doing anything like this.

What I did for V1 was to go on elance and hire it done. Money very well spent. And I'm doing exactly that again; I posted the elance job this morning and have one inquiry already. This time I will make sure that the updated code and the DLL get sent to the maintainer so others can use them.

When that's all wrapped up, hopefully by mid-week, I should be able to run PyInstaller on Windows 7 and get a clean bundle of PPQT2.

Meantime, I need to get to work on the remaining functionality. That means, for comfort and convenience, being able to code on my laptop. Much as I love my 27-inch iMac, I can't take it to a coffee shop. The laptop had fallen behind the desktop system in Qt and PyQt versions. So what else I've been doing this morning is installing Qt5.4.1, and the latest SIP, and PyQt5.4.1, on this laptop. The very lengthy PyQt make is chugging away as I write.

Last week the task of editing the V2 "suggested workflow" document brought me face to face with a flock of usage issues I had been postponing. My intent is that PPQT2 will be a front end that works smoothly with the "Ppgen" markup system that is becoming popular at the U.S. PGDP site, and with the "Fpgen" markup convention that is used by DP Canada. And that means looking ahead and asking myself, how do those markup systems handle things like block quotes, right-aligned strings, blocks of centered lines, and tables? Because I don't want to direct my users into doing things that would cause conflicts, if they decide to move to one of those markups. OTOH, I want to get the users to use a syntax that will be easy to code in the Translators that will convert to those markups -- or directly to ASCII or HTML, if they use the direct Translators I will provide.

That's been an interesting exercise and is not complete yet. It's kind of an annoying task for a couple of reasons. One is it forces me to deal with these systems that are accomplishing exactly the same damn results, but could they use any kind of standard markup to do it? Oh no of course not; they had to invent their own bloody markup language.

Another is that those markups are not superbly well documented. I was a professional technical writer and naturally have high standards for this. So I have to bind and gag my inner editor so I can just read the damn stuff and not waste time trying to rewrite it as I go.

Monday, May 4, 2015

So where are we, exactly?

Been quiet for a while, sorry. I spent several days doing the initial stages of post-proofing a moderately complex book, using the current PPQT2. In the process I found and fixed several minor bugs. By the end it was working quite smoothly, fully the equal of V1. I can use this tool. And will, if I can just get it finished and shipped.

Concurrently with using the program, I was documenting it. I already set up the help file, but that is a summary organized around the UI: menu by menu, panel by panel. An equally important, perhaps more important document is the task-oriented "suggested workflow" document. That takes the reader through a step-by-step process of post-processing a book, showing how to apply the features of PPQT to perform each.

There was a suggested-workflow for V1, of course. For V2 the initial work steps are the same with only minor changes due to changes in the menu structure. But being an obsessive scribbler I had to rewrite them anyway to be easier to read and more terse.

But a number of changes come in the sequence of later steps. V1 has its own ASCII reflow and HTML conversion built-in. For V2, both of those functions are handed off to Translators (that have yet to be written to an API yet to be implemented). I expect that the most-used Translators will be ones that convert to Ppgen and to Fpgen, markup languages unique to DP and DP-canada respectively. Those markups are used to feed into batch conversion apps that produce the ASCII, the HTML, the EPUB. So the documentation has to assume that the user is aiming toward a smooth conversion to another markup -- not aiming toward generating the final product within PPQT2. (Of course, I expect the user will continue to edit the Ppgen/Fpgen document in PPQT2. There are several advantages to doing so. But the responsibility for generating HTML, or for properly formatting an ASCII etext, falls on that package and on its documentation.)

Anyway, this changed my approach in documenting the tasks of post-processing. But I got that pretty well done.

On the shipping front, which means bundling a Python app as a self-contained package, there has been some progress. Hartmut, the top maintainer of PyInstaller, stepped up and took over the task of rebasing the Python3 code branch onto the current Develop branch. Or so it seemed...

Today I applied the latest PyInstaller to generating a Mac app, and it worked splendidly! If you want to try it, here's the link. So that was good.

Then I moved to Ubuntu. Downloaded the Python3 branch and installed it, ran it, and hit exactly the same problem (the bundled app dies looking for "orig-prefix.txt". If I hack around that, it hits another problem. Both come out of the same little stretch of code in compat.py and hook-site.py. When I compare these two source files between the Develop and Python3 branch, they are very different. Why? I thought when Python3 had been rebased onto Develop, these issues would disappear. I put a query on the PyInstaller list and stopped for the day.

Dang, so close.