[Tracker] Text extraction on text formats
- From: Luca Ferretti <elle uca libero it>
- To: Tracker List <tracker-list gnome org>
- Subject: [Tracker] Text extraction on text formats
- Date: Thu, 16 Nov 2006 16:57:40 +0100
I'm trying to check and eventually expand info in
http://live.gnome.org/Tracker/SupportedFormats.
So I'm planning to create files of various formats, then search for text
inside them.
############ Test Procedure ###
I used the "stable" version (0.5.1), while I've the CVS versions
installed too (I'll test it later).
By now I tested some word processor document formats: I wrote a one-line
document in OO.o Writer (the one in Ubuntu Edgy) and I saved it in
various format. The file has a content and some metadata (the one you
can add in File->Properties).
The exact procedure is:
1. create the ODT file
2. save it and close OO.o
3. open the ODT file
4. use File -> Save As ..
5. chose a different format
6. save the file in new "alien" format
7. close the file and OO.o
8. restart from #3
Then I searched with `tracker-search` at least 2 times for each file:
one for a word that's only in file content ("potenzialitÃ"), one for a
word that's only in file metadata ("particolare") - of course I wrote
this file in Italian language.
############# Test Results ###
ODT (OpenDocument Text)
content: yes
metadata: yes [1]
extra: keywords metadata are auto-tagged
OTT (OpenDocument Text Template)
content: no (????)
metadata: yes [1]
extra: as above
SXW (OpenOffice 1.x Text)
content: yes
metadata: no
STW (OpenOffice 1.x Text Template)
content: no
metadata: no
DOC (Word 97/2000/XP | Word 95 | Word 6.0)
content: yes [2]
metadata: no [3]
RTF (Rich Text Format)
content: no [4]
metadata: no [4]
########### Conclusions ###
I suspect that the RTF format is currently not managed by tracker. We
should manage it, 'cause it's the only format supported by all Word
Processors. Read note [4] about metadata and non ASCII characters.
Extraction of metadata seems to work only for ODT and OTT formats.
Moreover I don't understand why Tracker don't extract contents for
OpenDocument and OpenOffice templates. Is it a format design choice or a
tracker issue?
Finally could be interesting investigate the core (read crash) that
occurs every time DOC files are created or touch-ed. Is it possible to
execute tracker-extract directly with some debug options?
I'll update the page on gnome.org wiki, but I like if someone could
before perform the same test using my file (attached) or creating a
custom one (maybe in other languages).
Also could be interesting try to index DOC files created directly MS
Word (not in my computer).
########### Notes ###
[1] in File->Properties there are 2 tabs for metadata. Metadata in User
tab are ignored by tracker.
[2] tail-ing ~/.Tracker/tracker.lo I see that saving the doc file, a
core file is created. This could depend on tracker-extract
[3] note that is seems that metadata in User tab are not saved in DOC.
[4] Note that a) saving the original ODT as RTF, metadata are no longer
available in File->Properties, while it seems that are saved in RFT file
(\info -> \subject and \keywords and \doccomm) and b) accented letters
(ÃÃÃÃÃ) used in Italian language are replaced in RTF with some \XXX
code.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]