Re: [Tracker] Text extraction on text formats

From: Laurent Aguerreche <laurent aguerreche free fr>
To: Jamie McCracken <jamiemcc blueyonder co uk>
Cc: Tracker List <tracker-list gnome org>
Subject: Re: [Tracker] Text extraction on text formats
Date: Thu, 16 Nov 2006 22:23:01 +0100

Le jeudi 16 novembre 2006 Ã 18:55 +0000, Jamie McCracken a Ãcrit :

Luca Ferretti wrote:

I'm trying to check and eventually expand info in
http://live.gnome.org/Tracker/SupportedFormats.

So I'm planning to create files of various formats, then search for text
inside them. 

############ Test Procedure ###

I used the "stable" version (0.5.1), while I've the CVS versions
installed too (I'll test it later).

By now I tested some word processor document formats: I wrote a one-line
document in OO.o Writer (the one in Ubuntu Edgy) and I saved it in
various format. The file has a content and some metadata (the one you
can add in File->Properties).

The exact procedure is:
     1. create the ODT file
     2. save it and close OO.o
     3. open the ODT file
     4. use File -> Save As ..
     5. chose a different format
     6. save the file in new "alien" format
     7. close the file and OO.o
     8. restart from #3

Then I searched with `tracker-search` at least 2 times for each file:
one for a word that's only in file content ("potenzialitÃ"), one for a
word that's only in file metadata ("particolare") - of course I wrote
this file in Italian language.

############# Test Results ###

ODT (OpenDocument Text)
  content:          yes
  metadata:                 yes [1]
  extra:                    keywords metadata are auto-tagged

OTT (OpenDocument Text Template)
  content:                  no (????)
  metadata:                 yes [1]
  extra:                    as above

SXW (OpenOffice 1.x Text)
  content:                  yes
  metadata:                 no

STW (OpenOffice 1.x Text Template)
  content:                  no
  metadata:                 no

DOC (Word 97/2000/XP | Word 95 | Word 6.0)
  content:                  yes [2]
  metadata:                 no  [3]

RTF (Rich Text Format)
  content:                  no  [4]
  metadata:                 no  [4]

########### Conclusions ###

I suspect that the RTF format is currently not managed by tracker. We
should manage it, 'cause it's the only format supported by all Word
Processors. Read note [4] about metadata and non ASCII characters.


package unrtf in debian/ubuntu  universe might help with this - it has 
command line to convert to plain text - anyone wanna write a filter for 
this?


Extraction of metadata seems to work only for ODT and OTT formats.


that might be a libgsf limitation


Moreover I don't understand why Tracker don't extract contents for
OpenDocument and OpenOffice templates. Is it a format design choice or a
tracker issue?


dunno - I guess they have different mime types?

could be an easy matter to sort out (just copy the filter)


Finally could be interesting investigate the core (read crash) that
occurs every time DOC files are created or touch-ed. Is it possible to
execute tracker-extract directly with some debug options?


no but the errors are probably occuring in libgsf if thats any help?

gdb tracker-extract and make sure last param (mime) is "application/msword"


tracker-extract-msoffice misses a call to g_type_init()... but I still
get an assertion error. I am still investigating.


I'll update the page on gnome.org wiki, but I like if someone could
before perform the same test using my file (attached) or creating a
custom one (maybe in other languages).

Also could be interesting try to index DOC files created directly MS
Word (not in my computer).

########### Notes ###

[1] in File->Properties there are 2 tabs for metadata. Metadata in User
tab are ignored by tracker.


we only extract specific metadata not any old junk. We have File.Other 
as a dumping ground though for misc crap that needs to be indexed


[2] tail-ing ~/.Tracker/tracker.lo I see that saving the doc file, a
core file is created. This could depend on tracker-extract


most likely

Attachment: signature.asc
Description: Ceci est une partie de message =?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=

References:
- [Tracker] Text extraction on text formats
  - From: Luca Ferretti
- Re: [Tracker] Text extraction on text formats
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]