Re: [Tracker] Full Text Search and find word at offset fts:offsets



Hi Mildred,


There is the special function fts:offsets that return the offsets
where the query was found in a text property.

I was wondering, how can I later on take this offset and select an
extract to display where the text is found? I first thought that the
offset was a character offset and tried to split using that, but then
I noticed that didn't work. It's probably a word offset.

Do you know how I could split split the text into words. Is there a
known algorithm (regular expression?). Or is there a special SPARQL
function that can return the list of words in the text, with all
punctuation and non words removed?

Or better yet, is it possible to have an offset in bytes or characters ?


I think you have done *all* the right questions :-)

FTS, even if being probably the most known feature of Tracker in the
desktops, has been just a side feature for core developers. It can even
be disabled during build time, something that we do in our MeeGo
Harmattan builds. This paragraph is the excuse to justify why we haven't
given it more support lately :-)

As you say, and IIRC, fts:offsets returns the index of the words in the
text, without considering words that are not indexed (i.e. shorter than
the minimum), and without considering punctuation. That is definitely
not a good thing if you want to use fts:offsets, as you need to get
exactly the same list of words as parsed by the Unicode FTS parsers in
order to get the words matching. Currently there is no way of retrieving
that list of words, and anyway it would be quite costly to expose an API
which returns that list of words (costly in performance if we build the
list each time we get it requested; or costly memory wise if we
pregenerate and store in memory that list).

So far, the best thing would be to really return the byte index of the
words as found in the value of the properties. Note that this byte index
will not be the byte index of the word in the original file, as the
extraction depends on the file type.

I believe Ottela had also some other concerns regarding fts:offsets, for
example when working with multi-valued properties.

Any help in improving FTS to make fts:offsets work better would be
highly appreciated.

-- 
Aleksander




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]