Re: [Tracker] performance problems in the pdf extractor?



Hi!,

On miÃ, 2010-02-10 at 11:04 +0100, Michael Biebl wrote:
Hi,

I'm running the latest tracker from master (0.7.19-19-g97bcf4f)
When I do the initial indexing run (wiped all old database before
using tracker-control -r)
I get messages from tracker-miner-fs like this (around 30 or so)
(tracker-miner-fs:11608): Tracker-CRITICAL **: Could not process
'file:///home/michael/docs/ldd3_pdf/ldr3TOC.fm.pdf': Did not receive a
reply. Possible causes include: the remote application did not send a
reply, the message bus security policy blocked the reply, the reply
timeout expired, or the network connection was broken.

This only seems to happen for pdf documents afaics.
I'm pretty sure it is not a dbus security policy blocking the reply,
so I guess it's a timeout problem.

Anyone seen this before? Is this a problem in the miner or the pdf extractor?

I've also seen this for large PDF files with barely no text in them, the
extractor would look throughout all pages for text, and this can be a
time consuming operation.

Ideally (at least for my case), there should be a quick has_text()
function in poppler, so we don't make it uselessly go through all
streams. But this boils down to a more general problem, while extractors
should give up after some time, there should be a more elegant way to
tell the miner so than a DBus timeout, I plan to work on this soon.


Cheers,
Michael





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]