Sunday, October 5, 2014

Thinking About JSON

So PPQT stores quite a bit of metadata about a book, saving it in a file bookname.meta. When I created this in version 1, I followed an example in Mark Summerfield's book and devised my own custom file format. (I don't have the book near me on the vacation trip, but skimming the contents online I note that same chapter has topics on reading XML. Probably I looked at that and said "Oh hell no," and quickly cobbled up a simpler "fenced" syntax for my files.).

Everything about reading and writing that format was in one place in V1. From one angle it is a good idea to have all knowledge of a special file format in one place, but it really was a bad idea. It mandated a very long and complex routine that had to know how to pull out and format 8 or 10 different kinds of data on save, and parse and put back those same several kinds of data on load. In hindsight it was more important to isolate knowledge of the particular types of data in the modules that dealt with that data. So, a key goal for the V2 rewrite was to distribute the job of reading and writing types of metadata among the modules that manage those data.

Almost the first code I wrote for V2 was the metadata manager module. This handles reading or writing the top level of my metadata format. The various objects that comprise one open book "register" with the metadata manager, specifying the name of the section they handle, and giving it references to their reader and writer functions.

The metadata code recognizes the start of a fenced section on input and calls the registered reader for that section, passing the QTextStream for the metadata file. On save, it runs through its dictionary of registered section names and calls each writer in turn.

In this way, knowledge of the file organization is in the metadata manager, but all knowledge of how to format, or parse, or fetch or store the saved data is in the module that deals with that data. For example the worddata module stores the census of words in the book. It knows that each line of its metadata section is a word (with optional dictionary tag for words in a lang=tag span) and its count and some property flags. And it knows how it wants to store those words on input. None of that knowledge leaks out to become a dependency in the metadata file code.

As you can maybe tell, I was quite pleased with this scheme. There are a total of sixteen metadata sections in the V2 code, each managed by its own reader/writer pair, and all happily working right now. Some are one-liners, like {{EDITSIZE 13}} to remember the edit font size. Others like the vocabulary are hundreds of lines of formatted records. But all code complete and tested.

So obviously it must be time to rip it all up and re-do it!

A note from my longest and most communicative user asked, oh by the way, it would be really helpful if you could change your metadata format to something standard, like YAML or JSON.

Right after saying "Oh hell no," I had to stop and think and realize that really, this makes a lot of sense. Although I knew almost nothing about JSON, it is such a very simple syntax that I felt up to speed on it in about 20 minutes.

But how to generate it? Would I have to pip-install yet another module? Well, no. There's a standard library module for it. And it looks pretty simple to use. You basically feed a Python dict into json.dumps() and out comes JSON syntax as a string to write to the QTextStream. On input, readAll() the stream and drop it into json.loads() and out comes a Python dict.

This leads to a major revision of the V2 metadata scheme, but the basic structure remains.

The metadata file will consist of a single JSON object (i.e., a Python dict) whose keys are the names of the 16 sections. The value of each section name is a single JSON/Python value. For simple things like the edit point size, or the default dictionary tag, the value is a single number or string. For complicated things, the value is a dict (JSON object) or a list (JSON array).

As before, modules will register their reader/writer functions for their sections. But they will not be passed a text stream to read or write. Instead, each reader will receive the single Python value that represents its section. Each writer, instead of putting text into a stream, is expected to return the single Python value that encodes its section's data.

On input, the metadata manager will use json.loads() to get one big dict. It will run through the keys of the dict, find that key among its registered reader functions, and pass the key's value to the reader.

On output, the manager will run through the registered writers and form a dict with section-name keys and the values returned by their writers. Then just shove the json.dumps() of that dict into the text stream.

It all shapes up nicely in my imagination. Just a whole bunch of coding to implement it. And not just the metadata module and all the reader/writer functions, oh hell no, there are several unit test drivers that made liberal use of metadata calls to push test data into, and check the output of, their target modules. All those will have to be recoded to use JSON test data.

There are other implications of this change to be explored in later posts.

No comments: