[Tracker] Text extraction on text formats



I'm trying to check and eventually expand info in
http://live.gnome.org/Tracker/SupportedFormats.

So I'm planning to create files of various formats, then search for text
inside them. 

############ Test Procedure ###

I used the "stable" version (0.5.1), while I've the CVS versions
installed too (I'll test it later).

By now I tested some word processor document formats: I wrote a one-line
document in OO.o Writer (the one in Ubuntu Edgy) and I saved it in
various format. The file has a content and some metadata (the one you
can add in File->Properties).

The exact procedure is:
     1. create the ODT file
     2. save it and close OO.o
     3. open the ODT file
     4. use File -> Save As ..
     5. chose a different format
     6. save the file in new "alien" format
     7. close the file and OO.o
     8. restart from #3

Then I searched with `tracker-search` at least 2 times for each file:
one for a word that's only in file content ("potenzialitÃ"), one for a
word that's only in file metadata ("particolare") - of course I wrote
this file in Italian language.

############# Test Results ###

ODT (OpenDocument Text)
  content:              yes
  metadata:             yes [1]
  extra:                keywords metadata are auto-tagged

OTT (OpenDocument Text Template)
  content:              no (????)
  metadata:             yes [1]
  extra:                as above

SXW (OpenOffice 1.x Text)
  content:              yes
  metadata:             no

STW (OpenOffice 1.x Text Template)
  content:              no
  metadata:             no

DOC (Word 97/2000/XP | Word 95 | Word 6.0)
  content:              yes [2]
  metadata:             no  [3]

RTF (Rich Text Format)
  content:              no  [4]
  metadata:             no  [4]

########### Conclusions ###

I suspect that the RTF format is currently not managed by tracker. We
should manage it, 'cause it's the only format supported by all Word
Processors. Read note [4] about metadata and non ASCII characters.

Extraction of metadata seems to work only for ODT and OTT formats.

Moreover I don't understand why Tracker don't extract contents for
OpenDocument and OpenOffice templates. Is it a format design choice or a
tracker issue?

Finally could be interesting investigate the core (read crash) that
occurs every time DOC files are created or touch-ed. Is it possible to
execute tracker-extract directly with some debug options?

I'll update the page on gnome.org wiki, but I like if someone could
before perform the same test using my file (attached) or creating a
custom one (maybe in other languages).

Also could be interesting try to index DOC files created directly MS
Word (not in my computer).

########### Notes ###

[1] in File->Properties there are 2 tabs for metadata. Metadata in User
tab are ignored by tracker.

[2] tail-ing ~/.Tracker/tracker.lo I see that saving the doc file, a
core file is created. This could depend on tracker-extract

[3] note that is seems that metadata in User tab are not saved in DOC.

[4] Note that a) saving the original ODT as RTF, metadata are no longer
available in File->Properties, while it seems that are saved in RFT file
(\info -> \subject and \keywords and \doccomm) and b) accented letters
(ÃÃÃÃÃ) used in Italian language are replaced in RTF with some \XXX
code.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]