Re: Will Beagle index PDFs?
- From: Julian Satchell <j satchell eris qinetiq com>
- To: Jon Trowbridge <trow ximian com>
- Cc: dashboard-hackers gnome org, Christopher Orr <dashboard protactin co uk>
- Subject: Re: Will Beagle index PDFs?
- Date: Tue, 27 Jul 2004 15:43:11 +0100
You should just be able to re-use your existing html parser anyhow.
Thats how I pulled the metadata in for my indexer.
There are also a number of other external converters that generate html,
so why no make an ExternalConverterViaHTml abstract base class, which
will typically only need the actual external converter overriding in
specific sub-class?
Then you can use:
pdf2text with the htmlmeta flag
rtf2html (http://www.w3.org/Tools/HTMLGeneration/rtf2html.html)
xlhtml (http://chicago.sourceforge.net/xlhtml/)
ppthtml (distributed with the above)
wvhtml (ships as part of wvware
http://wvware.sourceforge.net/wvWare.html)
These would give you pdf, rtf, Excel, Powerpoint and Word indexing
respectively.
Julian
On Tue, 2004-07-27 at 10:33 -0500, Jon Trowbridge wrote:
> On Tue, 2004-07-27 at 13:31 +0100, Christopher Orr wrote:
> > I'm not sure if it's doing things the right way within the context
> > of the Beagle framework, but nevertheless it does work.
>
> Yes, it is doing things the right way. :) I've committed your patch to
> CVS.
>
> It is too bad that pdftotext doesn't provide a straightforward way to
> get at the metadata. Maybe we should be parsing the output of
> 'pdftotext -htmlmeta' instead --- it puts the metadata in <meta> tags,
> and the HTML it generates is so simplistic that we should be able to
> strip it out without too many problems.
>
> Thanks,
> -J
>
>
>
> _______________________________________________
> Dashboard-hackers mailing list
> Dashboard-hackers gnome org
> http://mail.gnome.org/mailman/listinfo/dashboard-hackers
>
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]