Re: [Tracker] Text extraction on text formats
- From: Luca Ferretti <elle uca libero it>
- To: Laurent Aguerreche <laurent aguerreche free fr>
- Cc: Tracker List <tracker-list gnome org>
- Subject: Re: [Tracker] Text extraction on text formats
- Date: Thu, 16 Nov 2006 22:46:23 +0100
Il giorno gio, 16/11/2006 alle 21.36 +0100, Laurent Aguerreche ha
scritto:
Le jeudi 16 novembre 2006 Ã 18:55 +0000, Jamie McCracken a Ãcrit :
Luca Ferretti wrote:
I suspect that the RTF format is currently not managed by tracker. We
should manage it, 'cause it's the only format supported by all Word
Processors. Read note [4] about metadata and non ASCII characters.
package unrtf in debian/ubuntu universe might help with this - it has
command line to convert to plain text - anyone wanna write a filter for
this?
hum,
$ unrtf --text pooooo.rtf
This is UnRTF, version 0.19.2
By Dave Davey and Marcos Serrou do Amaral
Original Author: Zach T. Smith
Processing pooooo.rtf...
### Translation from RTF performed by UnRTF, version 0.19.2
### For information about this marvellous program,
### please go to http://www.gnu.org/software/unrtf/unrtf.html
### document uses ANSI character set
### font table contains 4 fonts total
modello, ,schema,
AUTHOR: Luca Ferretti
### creaton date: 16 November 2006 15:29
### revision date: 1 January 1601
### last printed: 1 January 1601
### comments: StarWriter
-----------------
Questo ?? un semplice esempio delle potenzialit?? di OO.o
^^ it was "Ã" ^^ it was Ã
It is not possible to remove extra output (except with shell commands of
course)... Perhaps we should extract some parts of its source code.
from unrtf manual: "All output formats except HTML are "alpha" i.e.
limited and development has just begun" (see `man unrtf`). The program
was previously know as rtf2htm.
It seems that HTML output is better (unrtf info are in <!-- -->
comments) but I've an old version (0.19.2 on edgy) and there are still
issues in accented characters and \keyword content ( \subject and
\doccomm seems missing)
The latest (0.20.2) seems to support Unicode, maybe we should contact
unrtf maintainers and provide them bugreport.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]