Thursday, October 16, 2014

What can json.dump?

A few quick 'speriments to make sure that json.dumps() can handle all sorts of metadata. Two restrictions show up.

One, it rejects a bytes value with builtins.TypeError: b'\x00\x01\x03\xff' is not JSON serializable. This is an issue because one piece of metadata is an SHA hash signature of the document file. This lets me make sure that the metadata file is of the same generation as the document file. (If the user messed up a restore from backup, for example, restoring only the document but keeping a later metadata, all sorts of obscure failures would follow.) The output of QCryptographicHash(QCryptographicHash.Sha1) is a bytes value.

Two, it rejects a Python set value with the same error. The worddata module wants to store, for each vocabulary word, a set of a few integers encoding the word's properties, e.g. uppercase or mixedcase, contains an apostrophe, contains a hyphen, etc.

The solution in both cases is to ask json.dumps() to serialize, not the value, but the __repr__() of the value. On input, the inverse of __repr__() is to feed the string into ast.literal_eval(), and check the type of what comes out.

I put quite a bit of care into coding the input of the metadata, because I want tell the user to feel free to hand-edit the metadata. If the user can edit the file, the user can screw it up. There's no point in telling the user not to edit the file because she will anyway. Better to document it, and then be very leery of accepting any value it contains. Part of that is using literal_eval(), which checks the syntax of a presumed Python value and will not pass executable code (hence no code injection).

The old metadata format was quite simple. The JSON one, even if I tell it to indent prettily, will be less easy for a user to fiddle with.

Hmmm. Also in the old format, as long as the user didn't mess up one of the section boundaries, he lost at most one section's data. In fact, the error detections I coded into the current code reject only single lines (with a log message). But if the user edits a JSON file and mucks up a syntactic delimiter... Must think about how to contain JSON errors to single sections, and not allow a single deleted "}" or "," to cause the whole file to be unreadable.

No comments: