Re: [Tracker] PATCH: Faster PNG extractor

From: Martyn Russell <martyn lanedo com>
To: Jonatan Pålsson <jonatan palsson pelagicore com>
Cc: Philip Van Hoof <philip codeminded be>, "tracker-list gnome org" <tracker-list gnome org>
Subject: Re: [Tracker] PATCH: Faster PNG extractor
Date: Thu, 04 Jul 2013 09:27:29 +0100

On 04/07/13 08:10, Jonatan Pålsson wrote:

On 28 June 2013 18:37, Martyn Russell <martyn lanedo com> wrote:

On 28/06/13 07:45, Philip Van Hoof wrote:


Op 28/06/2013 8:30, Jonatan Pålsson schreef:

Hi Jonatan,



Hello Jonatan, Philip,


Hey Martyn!


Hello,

I definitely see your point here. Failing one extractor to switch over
to the a different one several thousand times is a huge waste of time.
This would happen when png-faster fails to skip to the end of the
file, most likely due to the IDAT chunks being of variable size. I'd
like to point out that this should be an uncommon scenario however
(based on the fact that I have never seen such a file). If it turns
out to be much more common than I anticipate, the usefulness of
png-faster can be debated :) The worst case for png-faster which I can
think of, is if the same software/camera produces all PNG files
scanned by Tracker - and these PNGs have variable sized IDATs. This
would be bad.

I agree. But if that's as unlikely as you say, it's better to workfaster for the common case and bug fix or patch the unlikely scenarioson a case by case basis. Where this because more likely (using the slowextractor) we can make a configure switch sure...

I'm obviously partial here, due to the approach taken in png-faster,
but I like the idea of separating different extraction strategies into
different extractor modules. This means they can easily be disabled
and prioritized, etc. A different approach (which would be taken if
the two extractors are merged) would be to use #ifdefs within the
extractor module, and this means we can select extractors during
compile time, but only during compile time.

I would prefer to avoid #ifdefs where possible. We have a lot of thosealready and they add to the maintenance burden.

Also, if the faster approach is the common case, it makes more sense togo with that and fall back. If vendors find all their PNGs fall into theslower case scenario, it would be easily patchable to never check theIDAT in the way the faster extractor does to save some small time.

Bottom line, I don't expect this to be something people need toconfigure in 99% of the use cases out there.

On a slightly different note, right now, some extractors can fall back
to some more generic extractor for example GStreamer, which is exactly
what I am going for in png-faster as well. The argument you make
concerning when the "faster" extractor fails is very valid for these
extractors as well, and I wonder.. Wouldn't it be nice to blacklist
certain extractors dynamically if they are prone to errors? Say
png-faster, or the mp3 extractor has failed five times in a row (or
several times during a short amount of time), and there is a more
generic extractor available, the specialized extractor could then

That sounds good in principal, but in reality what happens is, peoplehave very different content. All it takes is for 5 large PDFs and you'renow not indexing PDFs because 5 before this one took too long. ORperhaps you index a directory with a bunch of PNGs all written by anapplication which doesn't it incorrectly, now all other PNGs arediscriminated against.

I would certainly accept patches on this, but it shouldn't be thedefault because it's hard to correctly guess the heuristic of whatcontent is acceptable to a user for indexing and there are always falsepositives.

automatically be skipped by tracker-extract, and the extractor with a
lower priority could be chosen. With this functionality in place, the
original concern that png-faster fails very many times, should be
mitigated, while also possibly contributing to an overall performance
boost for the other modules which have more generic extractors
available. The blacklist could be kept in the memory of the
tracker-extract process, thus invalidating it after each mining (I
assume permanently faulty extractor modules are not common). Thoughts
on this?


It really depends on the content.

The reason we do the extraction in a separate process is because thecontent and libraries we use to do the extracting are so variable andcan crash from time to time. The reasons vary, from dodgy new libraryversion to updates to file formats in the content we extract.

This is just my opinion from our experience. I would certainly go withpatches that show improvement for indexing content generally speaking! :)


--
Regards,
Martyn

Founder and CEO of Lanedo GmbH.

References:
- Re: [Tracker] PATCH: Faster PNG extractor
  - From: Jonatan Pålsson

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]