[Fwd: Code to convert all types of Word Processing files to plain text.]



---------------------------- Original Message ----------------------------
Subject: Code to convert all types of Word Processing files to plain text.
From:    msevior physics unimelb edu au
Date:    Mon, December 15, 2003 10:43 pm
To:      dashboard-hackers gnome org
--------------------------------------------------------------------------

Hi folks,
         I've been in contact with wmealing about an idea of convert Word
Processor files to plain text files for indexing purposes. I
believe that one of the aims of the dashboard project is locate
files by content. Since a large fraction of the users generate
word processor files, getting the content of these files is
difficult - hence this code.

The attached code defines a simple C-interface into code that controls a
remote AbiWord process that converts files of doc,rtf,abw,wpd,sdx (plus
loads more types I can't remember) to plain text files.

The conversion is very fast as a view of the document is not generated.  A
190 page RTF document full of tables is converted to plain text in about 7
seconds on my 1 Ghz laptop.

There are just two simple functions you need to call.

/*
 * The function is fault tolerant and will report if a requested
 * file is not capable of being coverted.
 *
 * It will also restart AbiWord if it does crash on the next invocation *
of convertFileToText.
 *
 * Inputs: inFile - The path to the word processor file to be converted. *
        outFile - The path to the file containing the plain text of the *
                  word processor file.
 * The function blocks until the conversion is complete or until
 * the conversion fails.
 *
 * It returns 0 upon a successful conversion
 *    and -1 if the conversion fails.
 */
int convertFileToText(const char * inFile, const char * outFile)

and...

/*
 * Call this method after all conversions have completed. Otherwise you'll
* have a runaway AbiWord-2.2 process.
 */
int finalizeConversions(void)

You need CVS HEAD AbiWord (the 2.1.0 release we'll do in a week or two
will be fine) as well as the AbiCommand plugin to provide the command-line
interface which enables the conversions. The ConvertToText.c and
ConvertToText.h files are maintained in the AbiWord CVS archive in the
abiword-plugins cvs module and the abiword-plugins/tools/abicommand
directory.

I will actively maintain the code and add new features. (For example a
thumb-nailer of wordprocessing documents.)

Best wishes,

Martin Sevior
-- 
Andrew Ruthven
Senior Systems Engineer, Actrix Networks Ltd   -->   www.actrix.gen.nz
At Actrix puck actrix gen nz
At Home:  andrew etc gen nz

Attachment: ConvertToText.c
Description: Binary data

Attachment: ConvertToText.h
Description: Binary data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]