Re: Unicode?

Hi Joe,

So I am definitely not an expert in these matters.  But my
understanding is that Mono internally uses UTF-16 as its Unicode

Well, yeah, kind of.  I'm no expert with C#, but it seems to mean
"here's a 16-bit type, have fun".  I'm hesitant to call that
"internal".  :-)  I think it's only slightly more true than saying "C
uses UTF-8 internally" (here's an 8-bit type, have fun).

the wire, in displaying the results, etc.  Also, I have no idea how
Python handles Unicode data (the last time I used it heavily -- in
2004 or so -- it didn't handle it very well).

That's about the time I started using Python, and it works great for
me.  It has a 'unicode' (string only -- no chars in Python) type,
which can hold any Unicode string; you don't deal with encodings until
you want to do I/O.  So I'm confident that I'll be able to get Unicode
data from Beagle into Python, without too much trouble.

If you search using the command-line program beagle-query, do you find
the files?

I get the same result as using "Desktop Search"/beagle-search: works
for the Latin and Georgian, but no hits for the Linear B.

As far as Beagle is concerned, by itself it doesn't deal with
character encodings at all.  As far as underlying libs: GTK requires
UTF-8; underneath it GLib deals with different Unicode versions.

Since C# doesn't really provide a "unicode character" type (only a
16-bit type for stuffing with UTF-16), a program that wants to fully
support Unicode might need to deal a little bit with one encoding
(UTF-16) itself.  But I'm new to Mono, and I'm not sure my previous
sentence is true.  :-)

It's definitely possible that Lucene doesn't have any special handling
of these characters.  You might want to try running
beagle-extract-content on the file to see if the data is extracted

This extracts (as "Content:") the entire text of the file, in all 3
languages -- great!  So I would say the plain text filter (at least)
passes my characters correctly.

Also, in the "Desktop Search"/beagle-search window, at the bottom it
shows a preview of the text from the file; here it shows non-BMP
characters as "  " (2 spaces), just as I saw from a Python program.

I've never done any debugging of Beagle itself, but when I get home
tonight I'll try to narrow down how far my characters are getting
before getting converted to spaces.

- Ken

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]