New post at PGDP forums
At a user's request I posted a discussion of PPQT and ppgen in the ppgen forum topic. It's the first time in a long time I've posted anything at PGDP.
JSON customization
In the last post I noted that the json.dump() could not deal with either byte data or set data. Long-time PPQT supporter Frank replied by email showing me how one could customize the default() method of the encoder to handle these cases, turning a set into a list and a bytes into a string. That automates the encoding process, but decoding back to bytes or set data, he said, had to be handled after the JSONDecoder had run.
Well, not quite. I think I have worked this out to make both encoding and decoding of these types automatic. I must say that the standard library json module does not make this easy; the API is confusing and inconsistent and the documentation while accurate, is not exactly helpful. But here's what I have so far.
Custom Encoding
To customize encoding you define a class derived from json.JSONEncoder. In it you define just one method, default(obj). It receives a single Python object—could be number, string, dict, anything—and it returns an object that can be serialized by JSON. That can be the same object, or a different one. Or, if you don't want to handle it, call super().default(obj) which may or may not raise an error. So here's mine:
class Extended_Encoder(json.JSONEncoder): def default(self,obj): if isinstance(obj, bytes) : return { '<BYTES>' : "".join("{:02x}".format(c) for c in obj) } if isinstance(obj, set) : return { '<SET>' : list(obj) } return super().default(obj)
If obj is a bytes, return a dict with the key <BYTES> and a string value. If obj is a set, return a dict with the key <SET> and a list value.
You might think, if you are defining a custom class, that at some point you would create an instance of said class and use it. But nunh-unh. You just pass the name of the class to the json.dumps() method:
tdict = { 'version' : 2, 'vocab' : [ {'word' : 'foo', 'props' : set([1,3,5]) }, {'word' : 'bar', 'props' : set([3,5,7]) } ], 'hash' : b'\xde\xad\xbe\xef' } j_st = json.dumps(tdict, cls=Extended_Encoder)
What comes out, for the above test dict, is (with some newlines inserted)
{"vocab": [ {"word": "foo", "props": {"<SET>": [1, 3, 5]}}, {"word": "bar", "props": {"<SET>": [3, 5, 7]}}], "version": 2, "hash": {"<BYTES>": "deadbeef"}}
Custom Decoding
To customize JSON decoding, you don't make a custom class based on json.JSONDecode. (Why would you want decoding to be consistent with encoding?) No, you write a function to act as an "object hook". You create a custom decoder object by calling json.JSONDecoder passing the object_hook parameter:
def o_hook(d): #print('object in ',d) if 1 == len(d): [(key, value)] = d.items() if key == '<SET>' : d = set(value) if key == '<BYTES>' : d = bytes.fromhex(value) #print('object out',d) return d my_jdc = json.JSONDecoder(object_hook=o_hook) decoded_python = my_jdc.decode(j_st)
You call the decode() or raw_decode method of the custom decoder object. During decoding, it passes every object it decodes to the object hook function. The object hook is always called with a dict. The dict results from some level of JSON decoding. Sometimes the dict has multiple items, when it represents a higher level of decoding. Sometimes it has just one item, a JSON key string and a Python value resulting from normal decode, for example {'version':2} from the earlier test data. Or d may be {'<SET>':[1,3,5]}.
The object hook does not have to return a dict. You can return any Python object and it will be used as if it were the result of decoding some JSON. So when the key is <SET> or <BYTES>, don't return a dict, just return the converted set or bytes value.
So, to review:
- To customize JSON encoding, you make a custom class with a modified default() method. Then you call json.dumps() passing it the name of your class.
- To customize JSON decoding, you define a function and create a custom object by calling json.JSONDecode() passing it your function as an optional parameter, and you call the .decode() method of the custom object.
Yeah, that's clear.
Bullet-proofing Decode
The raw_decode() method takes a string and a starting index. It decodes one JSON object through its closing "}". It returns the decoded Python object and the string index of the character after the decoded object.
I believe I am going to use this to make the PPQT metadata file more error-resistant. My concern is that the user is allowed, even encouraged, to inspect and maybe edit the metadata. But if the user makes one little mistake (so easy to insert or delete a comma or "]" or "}" and so hard to see where) it makes that JSON object unreadable. If all the metadata is enclosed in one big object, a dict with one key for each section, then one little error means no metadata for the book at all. Not good.
So instead I will make each section its own top-level JSON object.
{"VERSION":2} {"DOCHASH": {"<BYTES>":"deadbeef..."} } {"VOCABULARY: { "able": {stuff}, "baker": {stuff}...} }
and so forth. Then if Joe User messes up the character census section, at least the pages and vocabulary and good-words and the other sections will still be readable. This might cause problems for somebody who wants to read or write the metadata in a program. But I think it is worthwhile to fool-proof the file.
No comments:
Post a Comment