---------------------------- Original Message ----------------------------
Subject: Code to convert all types of Word Processing files to plain text.
From: msevior physics unimelb edu au
Date: Mon, December 15, 2003 10:43 pm
To: dashboard-hackers gnome org
--------------------------------------------------------------------------
Hi folks,
I've been in contact with wmealing about an idea of convert Word
Processor files to plain text files for indexing purposes. I
believe that one of the aims of the dashboard project is locate
files by content. Since a large fraction of the users generate
word processor files, getting the content of these files is
difficult - hence this code.
The attached code defines a simple C-interface into code that controls a
remote AbiWord process that converts files of doc,rtf,abw,wpd,sdx (plus
loads more types I can't remember) to plain text files.
The conversion is very fast as a view of the document is not generated. A
190 page RTF document full of tables is converted to plain text in about 7
seconds on my 1 Ghz laptop.
There are just two simple functions you need to call.
/*
* The function is fault tolerant and will report if a requested
* file is not capable of being coverted.
*
* It will also restart AbiWord if it does crash on the next invocation *
of convertFileToText.
*
* Inputs: inFile - The path to the word processor file to be converted. *
outFile - The path to the file containing the plain text of the *
word processor file.
* The function blocks until the conversion is complete or until
* the conversion fails.
*
* It returns 0 upon a successful conversion
* and -1 if the conversion fails.
*/
int convertFileToText(const char * inFile, const char * outFile)
and...
/*
* Call this method after all conversions have completed. Otherwise you'll
* have a runaway AbiWord-2.2 process.
*/
int finalizeConversions(void)
You need CVS HEAD AbiWord (the 2.1.0 release we'll do in a week or two
will be fine) as well as the AbiCommand plugin to provide the command-line
interface which enables the conversions. The ConvertToText.c and
ConvertToText.h files are maintained in the AbiWord CVS archive in the
abiword-plugins cvs module and the abiword-plugins/tools/abicommand
directory.
I will actively maintain the code and add new features. (For example a
thumb-nailer of wordprocessing documents.)
Best wishes,
Martin Sevior
--
Andrew Ruthven
Senior Systems Engineer, Actrix Networks Ltd --> www.actrix.gen.nz
At Actrix puck actrix gen nz
At Home: andrew etc gen nz
Attachment:
ConvertToText.c
Description: Binary data
Attachment:
ConvertToText.h
Description: Binary data