Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

From: Martyn Russell <martyn lanedo com>
To: jamie mccrack gmail com
Cc: "Tracker \(devel\)" <tracker-list gnome org>, Jamie McCracken <jamie mccrack googlemail com>
Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
Date: Mon, 26 Apr 2010 09:54:33 +0100

On 25/04/10 21:59, Jamie McCracken wrote:

On Sun, 2010-04-25 at 22:34 +0200, Aleksander Morgado wrote:

Hi Jamie,

I think it makes sense to fix this. Just to be clear, does this mean we
don't need Pango in libtracker-fts/tracker-parser.c to determine word
breaks for CJK?


Thats not broken so would not recommend trying to "fix" that

Well, given the details Aleksander demonstrated previously in thisthread, word breaking for Chinese symbols is broken and yes that shouldbe fixed.

I think it is silly to use 2 different libraries to do the same thingand if one does things better than another...

IMHO, The tracker_text_normalize() in the extractor should just do utf8
validation. It should not attempt word breaking as thats cpu expensive
and being done by the parser already

Well, extraction already is pretty expensive. I see your point there butalso, it doesn't make sense to send n bytes over d-bus that won't beused either. So really it is the lesser of two evils. Currently we dopush a lot of data over d-bus.

But then how can we limit the extracted text based on the number of
words?


Well IMHO It should be limited by bytes in the extractor not words (as
per 0.6.x) - this is cheap and works well

This can also be way off if you consider that the average length of aword can be ~50:


  http://blogamundo.net/lab/wordlengths/

The parser will do the word limits when it breaks/normalizes them

So really just need to guestimate bytes to extract if a word limit is
specified - the extractor does not need to be precise here and if you
assumed say average byte count of a word was 20 bytes the you will
probably be ok. If the extractor extracts too many words the parser will
still limit it to the precise number of words so no harm is done

Apart from sending a lot more information over d-bus and spending moretime extracting it for each file.

Of course others may have other ideas but it does sound daft to me to
word break everything twice


That I agree with.

--
Regards,
Martyn

Follow-Ups:
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken

References:
- [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]