Re: Will Beagle index PDFs?



Hi all,

On 20/07/04 19:19, Srikant Jakilinki wrote:
It is quite easy to index PDF's. Just use the "pdftotext" as a filter and so, yes, in the very near future it will be done Ralph.

Based on pdftotext, I made a simple PDF filter for those who want to try it. I'm not sure if it's doing things the right way within the context of the Beagle framework, but nevertheless it does work.

I use things in conjunction with "libextractor" which can extract
> metadata from PDF's as well...

If someone were to write a full PDF filter, I'd imagine something like
this would be needed to extract the (little) meta-data PDFs include.

Also, as pdftotext has no concept of headings and so on, this filter
can't provide "hotness" information.  Though I'm not sure how easily
this can be done anyway: even `pdftohtml` uses CSS to make text appear
larger; so it seems it doesn't actually recognise headings etc either..

Regards,
Chris
--- cvs/beagle/Filters/Makefile.am	2004-07-25 00:30:08.000000000 +0100
+++ beagle/Filters/Makefile.am	2004-07-27 12:45:44.000000000 +0100
@@ -26,6 +26,7 @@
 	$(srcdir)/FilterMan.cs		\
 	$(srcdir)/FilterMusic.cs	\
 	$(srcdir)/FilterOpenOffice.cs	\
+	$(srcdir)/FilterPdf.cs		\
 	$(srcdir)/FilterPng.cs		\
 	$(srcdir)/FilterText.cs	
 
--- cvs/beagle/Filters/FilterPdf.cs	1970-01-01 01:00:00.000000000 +0100
+++ beagle/Filters/FilterPdf.cs	2004-07-27 12:35:53.124795827 +0100
@@ -0,0 +1,47 @@
+//
+// FilterPdf.cs: Very simplistic PDF filter
+//
+// Author:
+//   Christopher Orr <dashboard protactin co uk>
+//
+// Copyright 2004 by Christopher Orr
+//
+
+using System;
+using System.IO;
+using System.Diagnostics;
+
+namespace Beagle.Filters {
+
+	public class FilterPdf : Filter {
+
+		public FilterPdf ()
+		{
+			AddSupportedMimeType ("application/pdf");
+		}
+
+		protected override void DoPull ()
+		{
+			// get full file path from Filter
+			string path = CurrentFileInfo.Directory +"/"+ CurrentFileInfo.Name;
+			Console.WriteLine ("Converting PDF \"{0}\"", path);			
+
+			// create new external process
+			Process pc = new Process ();
+			pc.StartInfo.FileName = "pdftotext";
+			pc.StartInfo.Arguments = "\""+ path +"\" -";
+			pc.StartInfo.RedirectStandardInput = false;
+			pc.StartInfo.RedirectStandardOutput = true;
+			pc.StartInfo.UseShellExecute = false;
+			pc.Start ();
+
+			// add pdftotext's output to pool
+			StreamReader pout = pc.StandardOutput;
+			AppendText (pout.ReadToEnd ());
+			pout.Close ();
+			pc.Close ();
+			
+			Finished ();
+		}
+	}
+}


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]