Re: Beagle content extraction question
- From: Andrew Leung <aleung soe ucsc edu>
- To: D Bera <dbera web gmail com>
- Cc: Tim Bisson <Tim Bisson netapp com>, dashboard-hackers gnome org
- Subject: Re: Beagle content extraction question
- Date: Tue, 8 Jul 2008 18:17:51 -0700
Thanks a lot for you help. I think I'll look at inputting multiple
files at a time and seeing how that improves things.
Andrew
On Jul 8, 2008, at 6:04 PM, D Bera wrote:
Hi Andrew,
I am using beagle's beagle-extract-content program to extract
keywords from
files on my desktop for some later analysis. I've written a highly
...
However, it is slow. I noticed that beagle-extract-content will
spend ~150ms
opening the Filter and only 10ms or so actually crawling the file and
The initialization takes some time. Part of this is due to Mono's VM
initialization, part is due to Beagle's own initialization. All the
filters are basically plugins to beagle, so some time is also spent it
locating the filters from the plugins in the path.
extracting keywords. It seems to be taking the time to determine
which
filter to use on the file, even when I use the --mimetype flag to
tell it
The first call to "determine filters" does a bunch of other
initialization. beagle-extract-content takes multiple files as input.
If you give it multiple files, then you will notice that extraction is
pretty fast for all but the first file.
the type of the file. Is there anyway to speed up this process or
to tell it
specifically which filter to use? Alternatively, is there an API
for the
beagle-extract-content so that I can simply invoke a function from
C that
doesn't need to spend as much time determining which filter to use?
There is no way to specify which filter to use. But as I said, if you
pass it e.g. 10 files at a time, then you will notice speedup for the
whole process.
Beagle-extract-content is more of a debugging tool; there is no API to
use it. If you want to use your crawler then I see no easy way of
paying for the initialization cost only once. If all you need is the
data, then you can modify beagle-build-index to print all the
properties as it indexes the files. beagle-build-index is used to
create an index on-demand; it simply crawls the filesystem and uses
similar methods as beagle-extract-content to extract properties.
- dBera
--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]