Saturday, October 25, 2014

Pleased with JSON diagnostics

Sneaking in development a half-hour at a time. The new metadata.py is coded. Changing it to read and write JSON instead of my own meta-format has greatly simplified the code here, and it will simplify the dozen or so modules that are clients of metadata.py, when I get around to them. But how I'm rewriting its unit test. That involves feeding it bad stuff of various kinds and checking the output log messages. I'm pleased with how specific the diagnostics are, that come out of json.py. Here's an example.

ERROR:metadata:Error decoding metadata:
    Unterminated string starting at: line 2 column 13 (char 13)
ERROR:metadata:Text reads: {"VERSION": "2}

The middle line is the text from ValueError object produced by json.py. The third line comes from my code, which knows the starting point within the string where JSON was looking, and shows from there up to 100 characters further. This should make debugging bad user edits quite easy.

Monday, October 20, 2014

Fun with JSON

New post at PGDP forums

At a user's request I posted a discussion of PPQT and ppgen in the ppgen forum topic. It's the first time in a long time I've posted anything at PGDP.

JSON customization

In the last post I noted that the json.dump() could not deal with either byte data or set data. Long-time PPQT supporter Frank replied by email showing me how one could customize the default() method of the encoder to handle these cases, turning a set into a list and a bytes into a string. That automates the encoding process, but decoding back to bytes or set data, he said, had to be handled after the JSONDecoder had run.

Well, not quite. I think I have worked this out to make both encoding and decoding of these types automatic. I must say that the standard library json module does not make this easy; the API is confusing and inconsistent and the documentation while accurate, is not exactly helpful. But here's what I have so far.

Custom Encoding

To customize encoding you define a class derived from json.JSONEncoder. In it you define just one method, default(obj). It receives a single Python object—could be number, string, dict, anything—and it returns an object that can be serialized by JSON. That can be the same object, or a different one. Or, if you don't want to handle it, call super().default(obj) which may or may not raise an error. So here's mine:

class Extended_Encoder(json.JSONEncoder):
    def default(self,obj):
        if isinstance(obj, bytes) :
            return { '<BYTES>' : "".join("{:02x}".format(c) for c in obj) }
        if isinstance(obj, set) :
            return { '<SET>' : list(obj) }
        return super().default(obj)

If obj is a bytes, return a dict with the key <BYTES> and a string value. If obj is a set, return a dict with the key <SET> and a list value.

You might think, if you are defining a custom class, that at some point you would create an instance of said class and use it. But nunh-unh. You just pass the name of the class to the json.dumps() method:

tdict = {
    'version' : 2,
    'vocab' : [
        {'word' : 'foo', 'props' : set([1,3,5]) },
        {'word' : 'bar', 'props' : set([3,5,7]) } ],
    'hash' : b'\xde\xad\xbe\xef'
}
j_st = json.dumps(tdict, cls=Extended_Encoder)

What comes out, for the above test dict, is (with some newlines inserted)

{"vocab": [
  {"word": "foo", "props": {"<SET>": [1, 3, 5]}},
  {"word": "bar", "props": {"<SET>": [3, 5, 7]}}],
"version": 2,
"hash": {"<BYTES>": "deadbeef"}}

Custom Decoding

To customize JSON decoding, you don't make a custom class based on json.JSONDecode. (Why would you want decoding to be consistent with encoding?) No, you write a function to act as an "object hook". You create a custom decoder object by calling json.JSONDecoder passing the object_hook parameter:

def o_hook(d):
    #print('object in ',d)
    if 1 == len(d):
        [(key, value)] = d.items()
        if key == '<SET>' :
            d = set(value)
        if key == '<BYTES>' :
            d = bytes.fromhex(value)
    #print('object out',d)
    return d
my_jdc = json.JSONDecoder(object_hook=o_hook)
decoded_python = my_jdc.decode(j_st)

You call the decode() or raw_decode method of the custom decoder object. During decoding, it passes every object it decodes to the object hook function. The object hook is always called with a dict. The dict results from some level of JSON decoding. Sometimes the dict has multiple items, when it represents a higher level of decoding. Sometimes it has just one item, a JSON key string and a Python value resulting from normal decode, for example {'version':2} from the earlier test data. Or d may be {'<SET>':[1,3,5]}.

The object hook does not have to return a dict. You can return any Python object and it will be used as if it were the result of decoding some JSON. So when the key is <SET> or <BYTES>, don't return a dict, just return the converted set or bytes value.

So, to review:

  • To customize JSON encoding, you make a custom class with a modified default() method. Then you call json.dumps() passing it the name of your class.
  • To customize JSON decoding, you define a function and create a custom object by calling json.JSONDecode() passing it your function as an optional parameter, and you call the .decode() method of the custom object.

Yeah, that's clear.

Bullet-proofing Decode

The raw_decode() method takes a string and a starting index. It decodes one JSON object through its closing "}". It returns the decoded Python object and the string index of the character after the decoded object.

I believe I am going to use this to make the PPQT metadata file more error-resistant. My concern is that the user is allowed, even encouraged, to inspect and maybe edit the metadata. But if the user makes one little mistake (so easy to insert or delete a comma or "]" or "}" and so hard to see where) it makes that JSON object unreadable. If all the metadata is enclosed in one big object, a dict with one key for each section, then one little error means no metadata for the book at all. Not good.

So instead I will make each section its own top-level JSON object.

{"VERSION":2}
{"DOCHASH": {"<BYTES>":"deadbeef..."} }
{"VOCABULARY: {
   "able": {stuff},
   "baker": {stuff}...}
}

and so forth. Then if Joe User messes up the character census section, at least the pages and vocabulary and good-words and the other sections will still be readable. This might cause problems for somebody who wants to read or write the metadata in a program. But I think it is worthwhile to fool-proof the file.

Thursday, October 16, 2014

What can json.dump?

A few quick 'speriments to make sure that json.dumps() can handle all sorts of metadata. Two restrictions show up.

One, it rejects a bytes value with builtins.TypeError: b'\x00\x01\x03\xff' is not JSON serializable. This is an issue because one piece of metadata is an SHA hash signature of the document file. This lets me make sure that the metadata file is of the same generation as the document file. (If the user messed up a restore from backup, for example, restoring only the document but keeping a later metadata, all sorts of obscure failures would follow.) The output of QCryptographicHash(QCryptographicHash.Sha1) is a bytes value.

Two, it rejects a Python set value with the same error. The worddata module wants to store, for each vocabulary word, a set of a few integers encoding the word's properties, e.g. uppercase or mixedcase, contains an apostrophe, contains a hyphen, etc.

The solution in both cases is to ask json.dumps() to serialize, not the value, but the __repr__() of the value. On input, the inverse of __repr__() is to feed the string into ast.literal_eval(), and check the type of what comes out.

I put quite a bit of care into coding the input of the metadata, because I want tell the user to feel free to hand-edit the metadata. If the user can edit the file, the user can screw it up. There's no point in telling the user not to edit the file because she will anyway. Better to document it, and then be very leery of accepting any value it contains. Part of that is using literal_eval(), which checks the syntax of a presumed Python value and will not pass executable code (hence no code injection).

The old metadata format was quite simple. The JSON one, even if I tell it to indent prettily, will be less easy for a user to fiddle with.

Hmmm. Also in the old format, as long as the user didn't mess up one of the section boundaries, he lost at most one section's data. In fact, the error detections I coded into the current code reject only single lines (with a log message). But if the user edits a JSON file and mucks up a syntactic delimiter... Must think about how to contain JSON errors to single sections, and not allow a single deleted "}" or "," to cause the whole file to be unreadable.

Temporary distraction

When this arrives it will be a bit of a distraction while I move everything over from my oh-so-tired Mac Pro.

Today and tomorrow I hope to finish up the page table display. Next week the big JSON switch for metadata. Which, hopefully, I will be working on using a 27-inch, 5K-wide retina display...heh heh heh (rubs hands in gleeful anticipation)

Sunday, October 5, 2014

Thinking About JSON

So PPQT stores quite a bit of metadata about a book, saving it in a file bookname.meta. When I created this in version 1, I followed an example in Mark Summerfield's book and devised my own custom file format. (I don't have the book near me on the vacation trip, but skimming the contents online I note that same chapter has topics on reading XML. Probably I looked at that and said "Oh hell no," and quickly cobbled up a simpler "fenced" syntax for my files.).

Everything about reading and writing that format was in one place in V1. From one angle it is a good idea to have all knowledge of a special file format in one place, but it really was a bad idea. It mandated a very long and complex routine that had to know how to pull out and format 8 or 10 different kinds of data on save, and parse and put back those same several kinds of data on load. In hindsight it was more important to isolate knowledge of the particular types of data in the modules that dealt with that data. So, a key goal for the V2 rewrite was to distribute the job of reading and writing types of metadata among the modules that manage those data.

Almost the first code I wrote for V2 was the metadata manager module. This handles reading or writing the top level of my metadata format. The various objects that comprise one open book "register" with the metadata manager, specifying the name of the section they handle, and giving it references to their reader and writer functions.

The metadata code recognizes the start of a fenced section on input and calls the registered reader for that section, passing the QTextStream for the metadata file. On save, it runs through its dictionary of registered section names and calls each writer in turn.

In this way, knowledge of the file organization is in the metadata manager, but all knowledge of how to format, or parse, or fetch or store the saved data is in the module that deals with that data. For example the worddata module stores the census of words in the book. It knows that each line of its metadata section is a word (with optional dictionary tag for words in a lang=tag span) and its count and some property flags. And it knows how it wants to store those words on input. None of that knowledge leaks out to become a dependency in the metadata file code.

As you can maybe tell, I was quite pleased with this scheme. There are a total of sixteen metadata sections in the V2 code, each managed by its own reader/writer pair, and all happily working right now. Some are one-liners, like {{EDITSIZE 13}} to remember the edit font size. Others like the vocabulary are hundreds of lines of formatted records. But all code complete and tested.

So obviously it must be time to rip it all up and re-do it!

A note from my longest and most communicative user asked, oh by the way, it would be really helpful if you could change your metadata format to something standard, like YAML or JSON.

Right after saying "Oh hell no," I had to stop and think and realize that really, this makes a lot of sense. Although I knew almost nothing about JSON, it is such a very simple syntax that I felt up to speed on it in about 20 minutes.

But how to generate it? Would I have to pip-install yet another module? Well, no. There's a standard library module for it. And it looks pretty simple to use. You basically feed a Python dict into json.dumps() and out comes JSON syntax as a string to write to the QTextStream. On input, readAll() the stream and drop it into json.loads() and out comes a Python dict.

This leads to a major revision of the V2 metadata scheme, but the basic structure remains.

The metadata file will consist of a single JSON object (i.e., a Python dict) whose keys are the names of the 16 sections. The value of each section name is a single JSON/Python value. For simple things like the edit point size, or the default dictionary tag, the value is a single number or string. For complicated things, the value is a dict (JSON object) or a list (JSON array).

As before, modules will register their reader/writer functions for their sections. But they will not be passed a text stream to read or write. Instead, each reader will receive the single Python value that represents its section. Each writer, instead of putting text into a stream, is expected to return the single Python value that encodes its section's data.

On input, the metadata manager will use json.loads() to get one big dict. It will run through the keys of the dict, find that key among its registered reader functions, and pass the key's value to the reader.

On output, the manager will run through the registered writers and form a dict with section-name keys and the values returned by their writers. Then just shove the json.dumps() of that dict into the text stream.

It all shapes up nicely in my imagination. Just a whole bunch of coding to implement it. And not just the metadata module and all the reader/writer functions, oh hell no, there are several unit test drivers that made liberal use of metadata calls to push test data into, and check the output of, their target modules. All those will have to be recoded to use JSON test data.

There are other implications of this change to be explored in later posts.

Thursday, October 2, 2014

Well, that was easy...

Wrote loupeview.py from scratch in three short sessions over three days while hanging out on the farm and being sociable.

First I had to install bookloupe. This was nontrivial owing to the readme not being as specific on Mac OS as it might be. Wrote a detailed description for the pgdp forum; maybe it will get picked up and integrated someday. For the moment, bookloupe development appears to be stalled.

I looked over the source and concluded it would need major surgery to turn it into a Python lib via either Cython, SWIG or manual coding. So instead, invoke it via subprocess, and apply it to a temporary file.

The temporary file part, which I had been uncertain about, turned out to be amazingly easy. Qt has already thought of it, and the QTemporaryFile class gives you a temporary file in a platform-independent way. You create the object and open() it. That actually creates the file in some suitable place. From that point, the QTemporaryFile object is an actual QFile, with all its methods. You write to the file and close() it. Then you can use QFileInfo to get the full path to the QFile, and that's your argument to some command line program. When eventually the Q(Temporary)File object is garbage-collected, the actual file is automatically deleted.

I wanted to use subprocess.check_output(). The leading, positional argument to that is a list containing the elements of an executable command. The leading element of the list is the name of the command, or in this case the fully qualified pathname of the bookloupe executable. The user will have to provide that eventually through the still to be written Preferences dialog. But the paths module has a default that works for Mac OS and probably for Linux, so that works for now.

Last in the list comes the full path to the temporary file. Between the head and the tail come the parameters of the command. For bookloupe I wanted to pass a batch of dash-letter parameters. So this was my initial shot at coding this.

        command = [bl_path,'-e -s -l -m -v', fbts.fullpath()]
        # run it, capturing the output as a byte stream
        try:
            bytesout = subprocess.check_output( command, stderr=subprocess.STDOUT )
        except subprocess.CalledProcessError as CPE :
            # display a message with text from stderr
            return # leaving message_tuples empty

This provided a nice test of the code in the except clause, displaying "unknown option -e -s -l -m -v". Oh. Pretty clearly the underlying subprocess.popen was putting quotes around my string instead of simply substituting the string into a command. Oh sigh. With

command = [bl_path,'-e','-s','-l','-m','-v', fbts.fullpath()]

it ran perfectly. Here's a screenshot.

The user has doubleclicked on the mesage line headed "461". The doubleclick signal slot tells the editor to jump to line 461, column 48, where indeed there is a "spaced quote".

Bookloupe, like the gutcheck program it was forked from, produces a great quantity of nit-picky diagnostic messages. The way Guiguts dealt with this was with a special dialog that had one checkbox for each possible message type—40 or more of them. It would display in its report only the messages of the types you'd checked. In practice, you hit the "clear all" button and then checked one box at a time, filtering the report to one message type at a time. I think this way of handling it is simpler and just as effective. You can sort the table on either the message text column or the line number column ascending or descending. So if you prefer you can deal with the diagnostics in sequence from the last line up (preserving the lineation if you make edits). Or you can sort by message texts. Then you can deal with one group of messages at a time, just as with Guiguts but a simpler UI.