Saturday, September 20, 2014

Hunspell dicts and encodings

I've almost completed testing the word view panel, which displays the vocabulary of words (and word-like tokens) in the document, with their counts and some "properties" such as, are the spelled correctly? The table can be filtered various ways, and in particular the user can opt to show only the misspelled words.

So my test document had some French phrases, which I'd marked with <span lang='fr_FR'>, to request spell-check using the fr_FR dictionary. And this was working beautifully; words from phrases like je suis jeune fille showed up in the vocabulary list as properly spelled.

Except for words with accents: était, majesté and so on were shown as misspelled. Why?

Well, the whole thing reeks of character encoding issues, dunnit? Somewhere in the interface between a call in Python 3 and the C++ wrapper around Hunspell, there has to be an encoding step to get from Python's however-many-bit Unicode (16? 32? variable?) character string, and a C++ char *.

I experimented with encoding the word that I passed, but that only caused more problems. The hunspell call wanted a string, and word.encode(encoding='ISO-8859-1',errors='replace') produces a bytes object. So an immediate Type Error happened.

Then I looked at the hunspell wrapper code, and it uses PyArg_ParseTuple() to receive the word-string from Python. And per its doc (at the link if you care) it says "Unicode objects are converted to C strings using 'utf-8' encoding..."

So my Unicode word était is being properly passed into Hunspell as a UTF-8 string, without effort on my part. Hmmm.

Oh.

I remembered (from the month or so I spent buried in spellcheck technology in 2012, struggling to get spellcheck working in version 1) that the .aff file of a dictionary includes an encoding, specifying the encoding of the matching .dic file. I checked, and in the fr_FR.aff I had picked up (sometime or other, from OpenOffice.org, I think) had this as its opening line: SET ISO8859-15.

Now, if I was writing a spellchecker these days, I imagine I would use that to decode the file but store the decoded words in full Unicode or UTF-8. But just maybe Hunspell wasn't that smart. So I opened the two files in BBEdit (which has a convenient UI for changing the file encoding), changed that line to SET UTF-8 and saved both files in UTF-8.

Problem gone; now all French words from the test doc checked as correct, even those with accents.

So Hunspell was storing the dictionary words as Latin-1 strings, then comparing them to UTF-8 strings, and not surprisingly, getting mismatches. Making the dictionary file encoding match the Python wrapper interface fixed the problem.

Not quite! I can distribute some dictionaries with the program (which I also did with V1) but the user can get more or other dicts from anywhere. As long as they are Myspell/Hunspell compatible, they should work. Except, if they are not encoded UTF-8, they won't. I foresee problems here.

No comments: