Tuesday, December 23, 2014

Continuing ctypes and hunspell

Right, so we have used ctypes to locate the Hunspell dylib and invoke the Hunspell_create() function returning a handle to a C++ object of class Hunspell. That demonstrated that Python 3 strings could be passed to a C function that expected const char * parameters.

Then we invoked a method of the Hunspell object by calling the C wrapper Hunspell_get_dic_encoding(), and huzzah! it returned what it was supposed to return, Hunspell's belief about the encoding of the dictionary's .dic and .aff files. It returned 'ISO8859-1' which may prove to be significant.

Next was to try to invoke the most important method, spell(). If this works, I can toss the whole hunspell.py package and just use fewer than 20 lines of ctype code (maybe 30 lines, adding code for platform dependencies). Hunspell has a dozen other methods, suggestions, stemming, etc., but all I need is spell(word) yielding 0 for bad and nonzero for good. The C header file says,

LIBHUNSPELL_DLL_EXPORTED int Hunspell_spell(Hunhandle *pHunspell, const char *);

Translating to Python,

hunlib.Hunspell_spell.argtypes = [C.c_void_p, C.c_wchar_p]
hunlib.Hunspell_spell.restype = C.c_uint

OK, let's do it!

for s in [ 'a', 'the', 'asdfasdf' ] :
    t = hunlib.Hunspell_spell( hun_handle, s )
    print(t, s)

Output:

b'ISO8859-1'
0 a
0 the
0 asdfasdf

Not good! Neither "a" nor "the" are valid words.

My diagnosis is this. I know the dictionary was opened successfully, and the words are in it. Either the word is not being passed correctly, or Hunspell is not comparing it correctly. I tried several variations on passing the argument, for example I changed the argtypes to show it as taking a c_char_p (no change), and then converted the word (b = bytes(s,'ISO-8859-1','ignore')) and passed the byte string (no change) and again encoding as UTF-8 (no change).

It stands out that Hunspell thinks the dictionary is encoded Latin-1. It lurks somewhere in my memory that I solved a similar problem by converting the dictionary to UTF-8 encoding. The encoding of the .dic file is specified in the .aff file in a SET statement. So I opened both files in BBEdit and saved them as UTF-8, also changing the .aff file to read SET UTF-8 (which is the same as the SET statement in a Greek dictionary). Tried again.

b'ISO8859-1'
0 a
0 the
0 asdfasdf

Wait, what? The SET statement says, and the actual file encodings are, UTF-8, but get_dic_encoding returns ISO8859-1? Just in case there's some kind of file caching going on, I copy the dictionary files to a different folder and change the path string to match. No change! I re-save the files as UTF-8 "with BOM". No change; it still returns ISO8859-1.

Now I doubt my prior diagnosis. Hunspell is ignoring the content of the dictionary, which possibly means it isn't reading it at all. Maybe it is failing to open the files, not reporting a failure, and returning some kind of default?

Does the hunspell package do the same?

import hunspell
import os
dpath = '/Users/dcortes1/Desktop/scratch'
daff = os.path.join(dpath, 'en_US.aff')
ddic = os.path.join(dpath, 'en_US.dic')
Hobj = hunspell.HunSpell(daff, ddic)
Hobj.get_dic_encoding()
'ISO8859-1'
Hobj.spell('the')
False

OK, I am officially flabbergasted. My flabber is gast. That dictionary is UTF-8 and defines "the". It is nice, in a way, that the package (that works fine inside PPQT1 and 2) is failing exactly as my ctypes experiment failes, but what the heck am I doing wrong?

I'll post again when I understand more.

Update: this much is confirmed:

Hobj = hunspell.HunSpell('aasdf/asdf/asdf.aff','aasdf/asdf/asdf.dic')
Hobj.get_dic_encoding()
'ISO8859-1'

If the Hunspell object creation cannot open the .aff/.dic files, it fails silently and uses a null dictionary with a default encoding. And I don't see any way of testing whether this has happened or not!

1 comment:

Tessa said...

Great bblog you have