Re: Will Beagle index PDFs?

You should just be able to re-use your existing html parser anyhow.

Thats how I pulled the metadata in for my indexer.

There are also a number of other external converters that generate html,
so why no make an ExternalConverterViaHTml abstract base class, which
will typically only need the actual external converter overriding in
specific sub-class?

Then you can use:
pdf2text with the htmlmeta flag
rtf2html (
xlhtml  (
ppthtml (distributed with the above)
wvhtml (ships as part of wvware

These would give you pdf, rtf, Excel, Powerpoint and Word indexing


On Tue, 2004-07-27 at 10:33 -0500, Jon Trowbridge wrote:
> On Tue, 2004-07-27 at 13:31 +0100, Christopher Orr wrote:
> > I'm not sure if it's doing things the right way within the context 
> > of the Beagle framework, but nevertheless it does work.
> Yes, it is doing things the right way. :)  I've committed your patch to
> CVS.
> It is too bad that pdftotext doesn't provide a straightforward way to
> get at the metadata.  Maybe we should be parsing the output of
> 'pdftotext -htmlmeta' instead --- it puts the metadata in <meta> tags,
> and the HTML it generates is so simplistic that we should be able to
> strip it out without too many problems.
> Thanks,
> -J
> _______________________________________________
> Dashboard-hackers mailing list
> Dashboard-hackers gnome org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]