Re: [Tracker] Automatic Language Detection
- From: jamie <jamiemcc blueyonder co uk>
- To: Edward Duffy <eduffy gmail com>
- Cc: Tracker List <tracker-list gnome org>
- Subject: Re: [Tracker] Automatic Language Detection
- Date: Tue, 06 Mar 2007 11:51:47 +0000
On Mon, 2007-03-05 at 21:55 -0500, Edward Duffy wrote:
On 3/5/07, jamie <jamiemcc blueyonder co uk> wrote:
On Mon, 2007-03-05 at 18:19 -0500, Edward Duffy wrote:
Hi Guys -
I just wrote a patch for #377891[1], could I get some of you to test
it. I ran some pdfs I found with google.fr and google.it, and it
seems to be working correctly...but more eyes the better.
Both from http://software.wise-guys.nl/libtextcat/languages.html
great stuff but we only support utf-8 - are all those language modules
utf-8 based?
"""Our main focus will be on compiling a list of fingerprints of UTF-8
encoded languages, since Unicode is clearly the way to go and UTF-8 is
usually the best way to do Unicode."""
It works (for my tests) if I encode the buffer to UTF-8 first, and
I've been able to get away with just sending the first 1K of the file.
before I accept patch can you:
1) just include langs we have stopwords/stemmers for
2) check and verify each lang we support with utf8 content
3) if (2) fails use g_convert to convert utf8 to necessary char_set
I will fiddle with configure.ac once you have done the above,
jamie.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]