Code to convert all types of Word Processing files to plain text.



Hi folks,
         I've been in contact with wmealing about an idea of convert Word
Processor files to plain text files for indexing purposes. I
believe that one of the aims of the dashboard project is locate
files by content. Since a large fraction of the users generate
word processor files, getting the content of these files is
difficult - hence this code.

The attached code defines a simple C-interface into code that controls a
remote AbiWord process that converts files of doc,rtf,abw,wpd,sdx (plus
loads more types I can't remember) to plain text files.

The conversion is very fast as a view of the document is not generated.  A
190 page RTF document full of tables is converted to plain text in about 7
seconds on my 1 Ghz laptop.

There are just two simple functions you need to call.

/*
 * The function is fault tolerant and will report if a requested
 * file is not capable of being coverted.
 *
 * It will also restart AbiWord if it does crash on the next invocation
 * of convertFileToText.
 *
 * Inputs: inFile - The path to the word processor file to be converted.
 *         outFile - The path to the file containing the plain text of the
 *                   word processor file.
 * The function blocks until the conversion is complete or until
 * the conversion fails.
 *
 * It returns 0 upon a successful conversion
 *    and -1 if the conversion fails.
 */
int convertFileToText(const char * inFile, const char * outFile)

and...

/*
 * Call this method after all conversions have completed. Otherwise you'll
 * have a runaway AbiWord-2.2 process.
 */
int finalizeConversions(void)

You need CVS HEAD AbiWord (the 2.1.0 release we'll do in a week or two
will be fine) as well as the AbiCommand plugin to provide the command-line
interface which enables the conversions. The ConvertToText.c and
ConvertToText.h files are maintained in the AbiWord CVS archive in the
abiword-plugins cvs module and the abiword-plugins/tools/abicommand
directory.

I will actively maintain the code and add new features. (For example a
thumb-nailer of wordprocessing documents.)

Best wishes,

Martin Sevior

Attachment: ConvertToText.c
Description: Binary data

Attachment: ConvertToText.h
Description: Binary data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]