Monday, November 3, 2014

JSON Metadata: sorted dicts and sordid ones

Continuing on git branch new_meta. Finding each module that calls the metadata manager and recoding it to save and load in the new JSON format. This usually results in a vast simplification. Previously, the "writer" method received a stream handle and was responsible for creating and writing formatted lines of text to encode its type of metadata; and the "reader" method got a stream handle and had to read the lines of formatted text and decode them. Under the new regime, the writer returns a single Python value (typically a list or dict), and the reader gets that single value as an argument. No more formatting data as lines and streaming them with << or >> operators. Just a blob of data out, a blob of data in.

For each module there's a modname_test module that exercises it. These unit-test drivers used the metadata system heavily. They formatted metadata streams and pushed them in via the metadata manager, and then used the manager to suck the metadata back and check it. Or pushed in invalid metadata and checked the contents of the log for proper error messages. It was a handy way to exercise every branch.

Naturally when the metadata readers and writers of a module change, so also must change the test code that prepares metadata and reads it back. So far there's about 3 times as many lines of code to alter in the test drivers as in the driven code. (Picture a frowny-face icon here.)

All went smoothly modifying and testing the four types of metadata handled by book.py (edit font size, edit cursor position, default dictionary tag, and user bookmark positions 1-9). Each of the reader/writer pairs became simpler, as expected.

Next up in alphabetic sequence is chardata.py. This is the module that maintains the census of characters in the document. Originally it did it using a sorteddict from the blist package, but recently I discovered the sortedcontainers package which is as fast as blist, and pure Python.

Either way, the character census is in a SortedDict object with single unicode characters as keys, and integer counts as values. So obviously, the metadata writer function could consist of just: return self.census that is, return the value of the dict of character counts. The reader would receive that dict as a single value. It had to be a bit more careful because the user might have edited the metadata, so the reader has to do basic sanity checks: are the keys single characters, the counts greater than 0, etc.

But this pretty scheme didn't work out well for the test driver. The test driver loaded the document with the contents of "ABBCCC" and then called the metadata manager to get the character census. Immediate error: "SortedDict cannot be serialized by JSON". Oh. Right. OK, change the writer to return dict(self.census). Convert the SortedDict to an ordinary dict. This worked in the sense that it could be serialized to JSON, but when the test driver pulled the metadata and compared it, it failed with:

expected: {"CHARCENSUS":{"A":1,"B":2,"C":3}}
received: {"CHARCENSUS":{"B":2,"C":3,"A":1}}

Oops. Obviously what's happening is that when json.dumps() a dict, it writes it in the order returned by dict.items(), which is the order of the key hash table. That isn't predictable. Time to stop to think.

Ok, I can leave it this way, and write the test driver to basically do a set-wise comparison on two dicts, ensuring that the received dict has all, but only, the keys and values of the expected dict. Not fun. Also, if I leave it as-is, it pretty well screws the possibility of the user editing this part of the metadata file. How would you find the entry for "X" in a random-sequenced list of 150 or more characters? And think ahead to worddata, which has almost the same structure: if its 5000-10000 metadata values aren't in sorted order, what a sordid mess.

So better to change the metadata format to something that can be sequenced. I rewrote the metadata writer as:

    return [ [key,count] for (key, count) in self.census.items() ]

The items() method of a SortedDict returns them in sorted order by key. JSON serializes list items in the order given, so they are in the file in sequence. It was no more code in the reader, because the reader already had code to examine each (key, count) item for validity.

1 comment: