Re: Code to convert all types of Word Processing files to plaintext.



> On Mon, 2003-12-15 at 06:43, msevior physics unimelb edu au wrote:
>> Hi folks,
>>          I've been in contact with wmealing about an idea of convert
>> Word
>> Processor files to plain text files for indexing purposes.
>
> We'll also want to get out the metadata that's contained in those Word
> documents and feed those to the indexer as well.  Is there any rich
> semantic data in the content itself that might be useful to index?  If
> so, we might want to teach the indexer how to read that out as well,
> instead of just converting it to plain text.
>

Yes, it is very easy to extract the metadata for those formats where we
support metadata. At present these are abw, doc, rtf, and wpd.

How about also returning a flatfile file with name/attribute pairs?

eg.

Name: Frad Smith
Document title: My CV
Creation Date:9th Decemebr 2001
....

so the C-interface becomes..

ConvertToTextWithMetaData(const char * infile, const char * outfile, const
char *metaFile)

It might also be possible to sniff out just the metadata quickly for some
formats. In which case you wouldn't need to read the whole file (which
would speed things up for very large documents.)

What do you think?

Cheers

Martin





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]