---------------------------- Original Message ---------------------------- Subject: Code to convert all types of Word Processing files to plain text. From: msevior physics unimelb edu au Date: Mon, December 15, 2003 10:43 pm To: dashboard-hackers gnome org -------------------------------------------------------------------------- Hi folks, I've been in contact with wmealing about an idea of convert Word Processor files to plain text files for indexing purposes. I believe that one of the aims of the dashboard project is locate files by content. Since a large fraction of the users generate word processor files, getting the content of these files is difficult - hence this code. The attached code defines a simple C-interface into code that controls a remote AbiWord process that converts files of doc,rtf,abw,wpd,sdx (plus loads more types I can't remember) to plain text files. The conversion is very fast as a view of the document is not generated. A 190 page RTF document full of tables is converted to plain text in about 7 seconds on my 1 Ghz laptop. There are just two simple functions you need to call. /* * The function is fault tolerant and will report if a requested * file is not capable of being coverted. * * It will also restart AbiWord if it does crash on the next invocation * of convertFileToText. * * Inputs: inFile - The path to the word processor file to be converted. * outFile - The path to the file containing the plain text of the * word processor file. * The function blocks until the conversion is complete or until * the conversion fails. * * It returns 0 upon a successful conversion * and -1 if the conversion fails. */ int convertFileToText(const char * inFile, const char * outFile) and... /* * Call this method after all conversions have completed. Otherwise you'll * have a runaway AbiWord-2.2 process. */ int finalizeConversions(void) You need CVS HEAD AbiWord (the 2.1.0 release we'll do in a week or two will be fine) as well as the AbiCommand plugin to provide the command-line interface which enables the conversions. The ConvertToText.c and ConvertToText.h files are maintained in the AbiWord CVS archive in the abiword-plugins cvs module and the abiword-plugins/tools/abicommand directory. I will actively maintain the code and add new features. (For example a thumb-nailer of wordprocessing documents.) Best wishes, Martin Sevior -- Andrew Ruthven Senior Systems Engineer, Actrix Networks Ltd --> www.actrix.gen.nz At Actrix puck actrix gen nz At Home: andrew etc gen nz
Attachment:
ConvertToText.c
Description: Binary data
Attachment:
ConvertToText.h
Description: Binary data