Re: Beagle content extraction question

Hi Andrew,

> I am using beagle's beagle-extract-content program to extract keywords from
> files on my desktop for some later analysis. I've written a highly
> However, it is slow. I noticed that beagle-extract-content will spend ~150ms
> opening the Filter and only 10ms or so actually crawling the file and

The initialization takes some time. Part of this is due to Mono's VM
initialization, part is due to Beagle's own initialization. All the
filters are basically plugins to beagle, so some time is also spent it
locating the filters from the plugins in the path.

> extracting keywords. It seems to be taking the time to determine which
> filter to use on the file, even when I use the --mimetype flag to tell it

The first call to "determine filters" does a bunch of other
initialization. beagle-extract-content takes multiple files as input.
If you give it multiple files, then you will notice that extraction is
pretty fast for all but the first file.

> the type of the file. Is there anyway to speed up this process or to tell it
> specifically which filter to use? Alternatively, is there an API for the
> beagle-extract-content so that I can simply invoke a function from C that
> doesn't need to spend as much time determining which filter to use?

There is no way to specify which filter to use. But as I said, if you
pass it e.g. 10 files at a time, then you will notice speedup for the
whole process.
Beagle-extract-content is more of a debugging tool; there is no API to
use it. If you want to use your crawler then I see no easy way of
paying for the initialization cost only once. If all you need is the
data, then you can modify beagle-build-index to print all the
properties as it indexes the files. beagle-build-index is used to
create an index on-demand; it simply crawls the filesystem and uses
similar methods as beagle-extract-content to extract properties.

- dBera

Debajyoti Bera @
beagle / KDE fan
Mandriva / Inspiron-1100 user

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]