Beagle content extraction question

From: Andrew Leung <aleung soe ucsc edu>
To: dashboard-hackers gnome org
Cc: Tim Bisson <Tim Bisson netapp com>
Subject: Beagle content extraction question
Date: Tue, 8 Jul 2008 17:47:08 -0700

Hi All,

I am using beagle's beagle-extract-content program to extract keywordsfrom files on my desktop for some later analysis. I've written ahighly parallelized file system crawler that can crawl the filesystem's namespace very fast. I have modified it to use beagle-extract-content on each file to extract the file's keywords. The program(written in C) uses popen() to run beagle-extract-content and readsthe programs output from a socket and currently extracts contentssuccessfully.

However, it is slow. I noticed that beagle-extract-content will spend~150ms opening the Filter and only 10ms or so actually crawling thefile and extracting keywords. It seems to be taking the time todetermine which filter to use on the file, even when I use the --mimetype flag to tell it the type of the file. Is there anyway tospeed up this process or to tell it specifically which filter to use?Alternatively, is there an API for the beagle-extract-content so thatI can simply invoke a function from C that doesn't need to spend asmuch time determining which filter to use?


Thanks for you help in advance.

Andrew Leung

Follow-Ups:
- Re: Beagle content extraction question
  - From: D Bera

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]