Re: [Tracker] PATCH: Faster PNG extractor



On 04/07/13 08:10, Jonatan Pålsson wrote:
On 28 June 2013 18:37, Martyn Russell <martyn lanedo com> wrote:
On 28/06/13 07:45, Philip Van Hoof wrote:

Op 28/06/2013 8:30, Jonatan Pålsson schreef:

Hi Jonatan,


Hello Jonatan, Philip,

Hey Martyn!

Hello,

I definitely see your point here. Failing one extractor to switch over
to the a different one several thousand times is a huge waste of time.
This would happen when png-faster fails to skip to the end of the
file, most likely due to the IDAT chunks being of variable size. I'd
like to point out that this should be an uncommon scenario however
(based on the fact that I have never seen such a file). If it turns
out to be much more common than I anticipate, the usefulness of
png-faster can be debated :) The worst case for png-faster which I can
think of, is if the same software/camera produces all PNG files
scanned by Tracker - and these PNGs have variable sized IDATs. This
would be bad.

I agree. But if that's as unlikely as you say, it's better to work faster for the common case and bug fix or patch the unlikely scenarios on a case by case basis. Where this because more likely (using the slow extractor) we can make a configure switch sure...

I'm obviously partial here, due to the approach taken in png-faster,
but I like the idea of separating different extraction strategies into
different extractor modules. This means they can easily be disabled
and prioritized, etc. A different approach (which would be taken if
the two extractors are merged) would be to use #ifdefs within the
extractor module, and this means we can select extractors during
compile time, but only during compile time.

I would prefer to avoid #ifdefs where possible. We have a lot of those already and they add to the maintenance burden.

Also, if the faster approach is the common case, it makes more sense to go with that and fall back. If vendors find all their PNGs fall into the slower case scenario, it would be easily patchable to never check the IDAT in the way the faster extractor does to save some small time.

Bottom line, I don't expect this to be something people need to configure in 99% of the use cases out there.

On a slightly different note, right now, some extractors can fall back
to some more generic extractor for example GStreamer, which is exactly
what I am going for in png-faster as well. The argument you make
concerning when the "faster" extractor fails is very valid for these
extractors as well, and I wonder.. Wouldn't it be nice to blacklist
certain extractors dynamically if they are prone to errors? Say
png-faster, or the mp3 extractor has failed five times in a row (or
several times during a short amount of time), and there is a more
generic extractor available, the specialized extractor could then

That sounds good in principal, but in reality what happens is, people have very different content. All it takes is for 5 large PDFs and you're now not indexing PDFs because 5 before this one took too long. OR perhaps you index a directory with a bunch of PNGs all written by an application which doesn't it incorrectly, now all other PNGs are discriminated against.

I would certainly accept patches on this, but it shouldn't be the default because it's hard to correctly guess the heuristic of what content is acceptable to a user for indexing and there are always false positives.

automatically be skipped by tracker-extract, and the extractor with a
lower priority could be chosen. With this functionality in place, the
original concern that png-faster fails very many times, should be
mitigated, while also possibly contributing to an overall performance
boost for the other modules which have more generic extractors
available. The blacklist could be kept in the memory of the
tracker-extract process, thus invalidating it after each mining (I
assume permanently faulty extractor modules are not common). Thoughts
on this?

It really depends on the content.

The reason we do the extraction in a separate process is because the content and libraries we use to do the extracting are so variable and can crash from time to time. The reasons vary, from dodgy new library version to updates to file formats in the content we extract.

This is just my opinion from our experience. I would certainly go with patches that show improvement for indexing content generally speaking! :)

--
Regards,
Martyn

Founder and CEO of Lanedo GmbH.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]