Monday, October 20, 2014

Fun with JSON

New post at PGDP forums

At a user's request I posted a discussion of PPQT and ppgen in the ppgen forum topic. It's the first time in a long time I've posted anything at PGDP.

JSON customization

In the last post I noted that the json.dump() could not deal with either byte data or set data. Long-time PPQT supporter Frank replied by email showing me how one could customize the default() method of the encoder to handle these cases, turning a set into a list and a bytes into a string. That automates the encoding process, but decoding back to bytes or set data, he said, had to be handled after the JSONDecoder had run.

Well, not quite. I think I have worked this out to make both encoding and decoding of these types automatic. I must say that the standard library json module does not make this easy; the API is confusing and inconsistent and the documentation while accurate, is not exactly helpful. But here's what I have so far.

Custom Encoding

To customize encoding you define a class derived from json.JSONEncoder. In it you define just one method, default(obj). It receives a single Python object—could be number, string, dict, anything—and it returns an object that can be serialized by JSON. That can be the same object, or a different one. Or, if you don't want to handle it, call super().default(obj) which may or may not raise an error. So here's mine:

class Extended_Encoder(json.JSONEncoder):
    def default(self,obj):
        if isinstance(obj, bytes) :
            return { '<BYTES>' : "".join("{:02x}".format(c) for c in obj) }
        if isinstance(obj, set) :
            return { '<SET>' : list(obj) }
        return super().default(obj)

If obj is a bytes, return a dict with the key <BYTES> and a string value. If obj is a set, return a dict with the key <SET> and a list value.

You might think, if you are defining a custom class, that at some point you would create an instance of said class and use it. But nunh-unh. You just pass the name of the class to the json.dumps() method:

tdict = {
    'version' : 2,
    'vocab' : [
        {'word' : 'foo', 'props' : set([1,3,5]) },
        {'word' : 'bar', 'props' : set([3,5,7]) } ],
    'hash' : b'\xde\xad\xbe\xef'
}
j_st = json.dumps(tdict, cls=Extended_Encoder)

What comes out, for the above test dict, is (with some newlines inserted)

{"vocab": [
  {"word": "foo", "props": {"<SET>": [1, 3, 5]}},
  {"word": "bar", "props": {"<SET>": [3, 5, 7]}}],
"version": 2,
"hash": {"<BYTES>": "deadbeef"}}

Custom Decoding

To customize JSON decoding, you don't make a custom class based on json.JSONDecode. (Why would you want decoding to be consistent with encoding?) No, you write a function to act as an "object hook". You create a custom decoder object by calling json.JSONDecoder passing the object_hook parameter:

def o_hook(d):
    #print('object in ',d)
    if 1 == len(d):
        [(key, value)] = d.items()
        if key == '<SET>' :
            d = set(value)
        if key == '<BYTES>' :
            d = bytes.fromhex(value)
    #print('object out',d)
    return d
my_jdc = json.JSONDecoder(object_hook=o_hook)
decoded_python = my_jdc.decode(j_st)

You call the decode() or raw_decode method of the custom decoder object. During decoding, it passes every object it decodes to the object hook function. The object hook is always called with a dict. The dict results from some level of JSON decoding. Sometimes the dict has multiple items, when it represents a higher level of decoding. Sometimes it has just one item, a JSON key string and a Python value resulting from normal decode, for example {'version':2} from the earlier test data. Or d may be {'<SET>':[1,3,5]}.

The object hook does not have to return a dict. You can return any Python object and it will be used as if it were the result of decoding some JSON. So when the key is <SET> or <BYTES>, don't return a dict, just return the converted set or bytes value.

So, to review:

  • To customize JSON encoding, you make a custom class with a modified default() method. Then you call json.dumps() passing it the name of your class.
  • To customize JSON decoding, you define a function and create a custom object by calling json.JSONDecode() passing it your function as an optional parameter, and you call the .decode() method of the custom object.

Yeah, that's clear.

Bullet-proofing Decode

The raw_decode() method takes a string and a starting index. It decodes one JSON object through its closing "}". It returns the decoded Python object and the string index of the character after the decoded object.

I believe I am going to use this to make the PPQT metadata file more error-resistant. My concern is that the user is allowed, even encouraged, to inspect and maybe edit the metadata. But if the user makes one little mistake (so easy to insert or delete a comma or "]" or "}" and so hard to see where) it makes that JSON object unreadable. If all the metadata is enclosed in one big object, a dict with one key for each section, then one little error means no metadata for the book at all. Not good.

So instead I will make each section its own top-level JSON object.

{"VERSION":2}
{"DOCHASH": {"<BYTES>":"deadbeef..."} }
{"VOCABULARY: {
   "able": {stuff},
   "baker": {stuff}...}
}

and so forth. Then if Joe User messes up the character census section, at least the pages and vocabulary and good-words and the other sections will still be readable. This might cause problems for somebody who wants to read or write the metadata in a program. But I think it is worthwhile to fool-proof the file.

No comments: