Friday, May 23, 2014

First Python2 to Python3 Problem

PPQT, like its inspiration Guiguts, maintains a bunch of metadata in a separate file. When saving bookname.txt it also saves bookname.meta with things like the user's notes, the file positions of all page breaks, and much else. This is an inherently fragile scheme because nothing links the two files except their names. The operating system, for example, has no idea the two files should always be copied together. It works adequately because the users are schooled in the habit of keeping everything about one project in one folder, so there's rarely a problem.

However, in the early days of PPQT an early adopter exposed the weakness when he had to restore a project from backup, and restored the book file but not the meta file. So all the metadata were wrong, page breaks in the wrong place, etc.

In response, I added a simple hash signature. On saving a file, PPQT takes an SHA-1 hash of the document text, and writes the hash signature into the meta file. On opening a file, the text is hashed and the signature compared to the metadata. If they differ, the user gets a strongly-worded warning message.

The code to output the signature is basically this:

    cuisineart = QCryptographicHash(QCryptographicHash.Sha1)
    ...
    cuisineart.addData(the_document.toPlainText())
    meta_stream << '{{DOCHASH '
    meta_stream << bytes(cuisineart.result()).__repr__()
    meta_stream << ' }}\n'

("Cuisineart" is the name of a line of food processors. Ha ha.) So the whole text of the book file is poured into the blender. A signature was obtained as cuisineart.result(). This is a QByteArray. In PyQt4, one needed to coerce that to a Python bytes() class. That value's .__repr__() converted it to something printable. The result was a line in the .meta file like this:

{{DOCHASH '\x9a\x9fjG\x99\xd0\x1b\xea\x84\xdeT\x8f:\xb8\xfb\xd5\x06\x82\x10|' }}

Does anyone see the problem with this? In Python 2, the __repr__() of a bytes value is a character string. In Python 2, we still used the old comfortable, sloppy assumption that a char was a byte was a char.

In Python 3, it is a bytes string, because Python 3 draws a sharp distinction between characters—which are numeric tokens that stand for letters and have a bit precision somewhere between 8 and 32—and bytes, which are 8-bit unsigned numbers with no defined character representation.

Practical result? Code identical to the above, executed by Python 3, produces this:

{{DOCHASH "b'\\x9a\\x9fjG\\x99\\xd0\\x1b\\xea\\x84\\xdeT\\x8f:\\xb8\\xfb\\xd5\\x06\\x82\\x10|'" }}

Even when the byte values started out identical, '\xde\xad' != "b'\\xde\\xad'".

In PPQT2, metadata reading is distributed. Various modules "register" to read and write different metadata sections. A reader function is called when a line starts with {{SECTION. It is passed the SECTION value, a version number, and a "parm" containing whatever string followed the SECTION on that line. In the case of the DOCHASH section, the parm is the signature string. The reader for the DOCHASH section began like this:

    def _read_hash(self, stream, sentinel, v, parm) :
        cuisineart = QCryptographicHash(QCryptographicHash.Sha1)
        cuisineart.addData(the_document.toPlainText())
        if parm != cuisineart.result().__repr__() :
            '''issue horrible warning to user'''

And of course, being executed in Python 3, it didn't work; the repr of the new signature has a b in it. Fortunately there is the version parameter. That is read earlier in the metadata stream. In early files it's omitted and defaulted to 0; later files have it as {{VERSION 0}}. So the DOCHASH reader can do this:

    def _read_hash(self, stream, sentinel, v, parm) :
        if v < '2' :
            '''do something to make an old signature compatible'''

But, um, what? Experimenting on the command line, we find this:

 >>> b = b'\xde\xad\xbe\xef'
 >>> c = '\xde\xad\xbe\xef'
 >>> b == c
 False
 >>> # very well, coerce c into bytes
 >>> bytes(c)
Traceback (most recent call last):
  File "", line 1, in 
builtins.TypeError: string argument without an encoding
 >>> bytes(c,'Latin-1','ignore')
b'\xde\xad\xbe\xef'
 >>> bytes(c,'Latin-1','ignore) == b
True
 >>> b.__repr__() == bytes(c,'Latin-1','ignore').__repr__()
True

So the answer is to convert the old char value into a bytes value. The bytes() function insists that I say how the chars are encoded. This is reasonable for the general case, but I happen to know these chars aren't chars; they're just bytes. I looked for an encoding type that would convey that, but didn't see any. So I use 'Latin-1' with 'ignore' to say, if some byte isn't Latin-1, just pass it along thankyouverymuch.

I think this'll work:

    def _read_hash(self, stream, sentinel, v, parm) :
        if v < 2 :
            parm = bytes(parm,'Latin-1','ignore').__repr__()
        cuisineart = QCryptographicHash(QCryptographicHash.Sha1)
        cuisineart.addData(the_document.toPlainText())
        if parm != cuisineart.result().__repr__() :
            '''issue horrible warning to user'''

No comments: