Beagle content extraction question



Hi All,

I am using beagle's beagle-extract-content program to extract keywords from files on my desktop for some later analysis. I've written a highly parallelized file system crawler that can crawl the file system's namespace very fast. I have modified it to use beagle-extract- content on each file to extract the file's keywords. The program (written in C) uses popen() to run beagle-extract-content and reads the programs output from a socket and currently extracts contents successfully.

However, it is slow. I noticed that beagle-extract-content will spend ~150ms opening the Filter and only 10ms or so actually crawling the file and extracting keywords. It seems to be taking the time to determine which filter to use on the file, even when I use the -- mimetype flag to tell it the type of the file. Is there anyway to speed up this process or to tell it specifically which filter to use? Alternatively, is there an API for the beagle-extract-content so that I can simply invoke a function from C that doesn't need to spend as much time determining which filter to use?

Thanks for you help in advance.

Andrew Leung


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]