New post at PGDP forums
At a user's request I posted a discussion of PPQT and ppgen in the ppgen forum topic. It's the first time in a long time I've posted anything at PGDP.
JSON customization
In the last post I noted that the json.dump() could not deal with either byte data or set data. Long-time PPQT supporter Frank replied by email showing me how one could customize the default() method of the encoder to handle these cases, turning a set into a list and a bytes into a string. That automates the encoding process, but decoding back to bytes or set data, he said, had to be handled after the JSONDecoder had run.
Well, not quite. I think I have worked this out to make both encoding and decoding of these types automatic. I must say that the standard library json module does not make this easy; the API is confusing and inconsistent and the documentation while accurate, is not exactly helpful. But here's what I have so far.
Custom Encoding
To customize encoding you define a class derived from json.JSONEncoder. In it you define just one method, default(obj). It receives a single Python object—could be number, string, dict, anything—and it returns an object that can be serialized by JSON. That can be the same object, or a different one. Or, if you don't want to handle it, call super().default(obj) which may or may not raise an error.
So here's mine:
class Extended_Encoder(json.JSONEncoder):
def default(self,obj):
if isinstance(obj, bytes) :
return { '<BYTES>' : "".join("{:02x}".format(c) for c in obj) }
if isinstance(obj, set) :
return { '<SET>' : list(obj) }
return super().default(obj)
If obj is a bytes, return a dict with the key <BYTES> and a string value.
If obj is a set, return a dict with the key <SET> and a list value.
You might think, if you are defining a custom class, that at some point you would create an instance of said class and use it. But nunh-unh. You just pass the name of the class to the json.dumps() method:
tdict = {
'version' : 2,
'vocab' : [
{'word' : 'foo', 'props' : set([1,3,5]) },
{'word' : 'bar', 'props' : set([3,5,7]) } ],
'hash' : b'\xde\xad\xbe\xef'
}
j_st = json.dumps(tdict, cls=Extended_Encoder)
What comes out, for the above test dict, is (with some newlines inserted)
{"vocab": [
{"word": "foo", "props": {"<SET>": [1, 3, 5]}},
{"word": "bar", "props": {"<SET>": [3, 5, 7]}}],
"version": 2,
"hash": {"<BYTES>": "deadbeef"}}
Custom Decoding
To customize JSON decoding, you don't make a custom class based on json.JSONDecode. (Why would you want decoding to be consistent with encoding?) No, you write a function to act as an "object hook". You create a custom decoder
object by calling json.JSONDecoder passing the object_hook parameter:
def o_hook(d):
#print('object in ',d)
if 1 == len(d):
[(key, value)] = d.items()
if key == '<SET>' :
d = set(value)
if key == '<BYTES>' :
d = bytes.fromhex(value)
#print('object out',d)
return d
my_jdc = json.JSONDecoder(object_hook=o_hook)
decoded_python = my_jdc.decode(j_st)
You call the decode() or raw_decode method of the custom decoder object.
During decoding, it passes every object it decodes to the object hook function.
The object hook is always called with a dict. The dict results from some level of JSON decoding. Sometimes the dict has multiple items, when it represents a higher level of decoding. Sometimes it has just one item, a JSON key string and a Python value resulting from normal decode, for example {'version':2} from the earlier test data. Or
d may be {'<SET>':[1,3,5]}.
The object hook does not have to return a dict. You can return any Python object and it will be used as if it were the result of decoding some JSON. So when the key is <SET> or <BYTES>, don't return a dict, just return the converted set or bytes value.
So, to review:
- To customize JSON encoding, you make a custom class with a modified default() method.
Then you call json.dumps() passing it the name of your class.
- To customize JSON decoding, you define a function and create a custom object
by calling json.JSONDecode() passing it your function as an optional parameter, and you call the
.decode() method of the custom object.
Yeah, that's clear.
Bullet-proofing Decode
The raw_decode() method takes a string and a starting index. It decodes one JSON object through its closing "}".
It returns the decoded Python object and the string index of the character after the decoded object.
I believe I am going to use this to make the PPQT metadata file more error-resistant.
My concern is that the user is allowed, even encouraged, to inspect and maybe edit the metadata.
But if the user makes one little mistake (so easy to insert or delete a comma or "]" or "}" and so hard to see where) it makes that JSON object unreadable. If all the metadata is enclosed in one big object, a dict with one key for each section, then one little error means no metadata for the book at all. Not good.
So instead I will make each section its own top-level JSON object.
{"VERSION":2}
{"DOCHASH": {"<BYTES>":"deadbeef..."} }
{"VOCABULARY: {
"able": {stuff},
"baker": {stuff}...}
}
and so forth. Then if Joe User messes up the character census section, at least the pages and vocabulary and good-words and the other sections will still be readable. This might cause problems for somebody who wants to read or write the metadata in a program. But I think it is worthwhile to fool-proof the file.