Re: Unicode?



Hi Ken,

On 5/27/07, Ken Harris <kengruven gmail com> wrote:
For my Python/GTK/libbeagle program, I want to support Unicode fully
(ha), so I spent some time learning how to work with Unicode in C#
(where 'char' is only 16 bits -- d'oh!) for my Beagle filter.  I
thought I had it all figured out...

So I am definitely not an expert in these matters.  But my
understanding is that Mono internally uses UTF-16 as its Unicode
representation.

When working with native libraries, by default Mono converts to UTF-8
when passing strings.  GTK, which is the widget toolkit that
beagle-search uses, requires that strings be in UTF-8.

When I couldn't make it work, I just made a plain text file with 3
Latin characters, 3 Georgian characters, and 3 Linear B (i.e.,
non-BMP) characters, and saved it as UTF-8.  Then I fired up "Desktop
Search" / "beagle-search" (every app under GNOME seems to have two
names!) and tried searching by each triple.  As I feared, Latin and
Georgian worked, but Linear B didn't.  (From Python, it looks like
U+10000 is coming out as 2 ASCII spaces.)

The big question here is: what part of the search is failing?  There
are lots of places this could be failing: in trying to analyze the
characters into words, in the conversion to UTF-8 for sending it over
the wire, in displaying the results, etc.  Also, I have no idea how
Python handles Unicode data (the last time I used it heavily -- in
2004 or so -- it didn't handle it very well).

If you search using the command-line program beagle-query, do you find
the files?

Does Beagle not support Unicode >3.0 yet?  Is somebody working on it
already?  Do Beagle's dependencies (like Lucene or Gtk#) handle newer
Unicode versions?  (Hopefully it can be upgraded piecemeal, and not
one-huge-change-all-at-once.)

As far as Beagle is concerned, by itself it doesn't deal with
character encodings at all.  As far as underlying libs: GTK requires
UTF-8; underneath it GLib deals with different Unicode versions.
Looking at the ChangeLog, it looks like it's had support for Unicode
3.0 since 2.0.  (4.1 support was added October 2005 and is in 2.10.0;
5.0 was added July 2006 and went into 2.12.2)  So we should be fine
there.

It's definitely possible that Lucene doesn't have any special handling
of these characters.  You might want to try running
beagle-extract-content on the file to see if the data is extracted
reasonably.

Joe



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]