Re: spell checking (long, sorry...)



On Thu,  5 July 20:33 Albrecht Dreß wrote:

> But now for the bad news... for me it does not work for words with "umlauts"
> (german national characters). Looking into src/spell-chek.c, line 1200, I
> found the following rexexp to isolate words:
> 
>     const gchar *new_word_regex = "\\<[[:alpha:]']*\\>";
> 
> Apparently, my glibc implementation (yes LANG/LC_ALL are de_DE.ISO-8859-1)
> does not recognise Umlauts neither in the regexp nor in a call to isalpha().
> Not sure if this changed in glibc 2.2. Changing the expression to
> "\\<[[:alpha:]äöüÄÖÜß']*\\>" helps a little, as most words are now recognised.
> The exception are those *starting* with an Umlaut (like "ähnlich")...

Maybe the RE library is buggy.  AFAIK [:alpha:] is supposed to match alphabetic
characters with or without diacritical marks.  OTOH [a-z] merely enumerates
the characters between 'a' and 'z'; not quite the same thing.

> An other problem might be the "empty word separator expression" (\< and \>).
> During the discussions about the URL regexp's it emerged that there are
> probabely more people around whose rexexp implementation does not support this
> feature.

Just a thought ... why not use PCRE in Balsa.  It has a posix API as well as
its own so no code changes are necessary.  RE syntax is the same as perl, so
you can rely on \b as marking a word boundary.  Unfortunately its character
class tables are generated at compile time, so it may not solve the [:alpha:]
thing.

> So I guess we should think about rewriting this part of code, and
> maybe replace the regexec stuff by something hardcoded. However, if the
> isalpha implementation was not changed in recent glibc's, then we have the
> problem that we had to hand-code all national character sets... Opinions?

isalpha() and friends is supposed to be affected by the LANG environment.
The same is supposed to be true for [:alpha:].  I suspect a hard coded parser
using isalpha() might have the same problem given the same libc.
Maybe a program *must* call set_locale() for this to happen, can't remember
offhand.

I dislike american spelling on my desktop so I set LANG to en_GB.  From
time to time I get irritating and unexpected side effects from this too
compared to the C locale (e.g. sorting drives me mental).  Presumably
the Posix committee saw fit to punish the world for having the temerity
not to speak american english.

Regards,
Brian Stafford




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]