Re: [Tracker] Automatic Language Detection



On 3/5/07, jamie <jamiemcc blueyonder co uk> wrote:
On Mon, 2007-03-05 at 18:19 -0500, Edward Duffy wrote:
> Hi Guys -
>
> I just wrote a patch for #377891[1], could I get some of you to test
> it.  I ran some pdfs I found with google.fr and google.it, and it
> seems to be working correctly...but more eyes the better.


Both from http://software.wise-guys.nl/libtextcat/languages.html
great stuff but we only support utf-8 - are all those language modules
utf-8 based?

"""Our main focus will be on compiling a list of fingerprints of UTF-8
encoded languages, since Unicode is clearly the way to go and UTF-8 is
usually the best way to do Unicode."""

It works (for my tests) if I encode the buffer to UTF-8 first, and
I've been able to get away with just sending the first 1K of the file.

Also of interest is detecting CJK langs so we can automatically use
pango to word break them.

After running about a dozen or so (supposedly) japanesse pdf through
it with no luck, I saw this:

"""We were told that the East Asian language models (notably Chinese,
Korean, Japanese) may be less than adequate because of white space
issues. If you are a native speaker, you might be able to shed some
light on this issue."""

So..no, for now.



jamie.







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]