Re: [Tracker] Text extraction on text formats

Laurent Aguerreche wrote:
Le jeudi 16 novembre 2006 Ã 22:53 +0100, Laurent Aguerreche a Ãcrit :
Le jeudi 16 novembre 2006 Ã 22:46 +0100, Luca Ferretti a Ãcrit :
Il giorno gio, 16/11/2006 alle 21.36 +0100, Laurent Aguerreche ha
Le jeudi 16 novembre 2006 Ã 18:55 +0000, Jamie McCracken a Ãcrit :
Luca Ferretti wrote:
I suspect that the RTF format is currently not managed by tracker. We
should manage it, 'cause it's the only format supported by all Word
Processors. Read note [4] about metadata and non ASCII characters.
package unrtf in debian/ubuntu universe might help with this - it has command line to convert to plain text - anyone wanna write a filter for this?

$ unrtf --text pooooo.rtf
This is UnRTF, version 0.19.2
By Dave Davey and Marcos Serrou do Amaral
Original Author: Zach T. Smith
Processing pooooo.rtf...
### Translation from RTF performed by UnRTF, version 0.19.2
### For information about this marvellous program,
### please go to
### document uses ANSI character set
### font table contains 4 fonts total
modello, ,schema,
AUTHOR: Luca Ferretti
### creaton date: 16 November 2006 15:29
### revision date: 1 January 1601
### last printed: 1 January 1601
### comments: StarWriter

Questo ?? un semplice esempio delle potenzialit?? di OO.o
         ^^ it was "Ã"                          ^^ it was Ã

A question: what is encoding of this string? UTF8, ISO-something,
Win-something, etc.? Now, I can see that with libGSF and a RTF file:

Doc.Comment="Questo file altro non \303\250 che un esempio di modello di
file per OO.o Writer per testare l'indicizzazione di Tracker";

It is really strange... I also have the same problem on some of my DOC
files but I do not know whether it impacts any DOC file.
I do not see any way to resolve that problem except to contact authors
of libgsf. I tried "wv" with wvSummay command to print the data that we
are looking for but wv has the same problems and it also uses libgsf...

Nevertheless I send a patch that makes tracker-extract to print not
empty metadata (yeah!) and to have a better memory management.

have applied to CVS thanks

Mr Jamie McCracken

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]