Re: [Tracker] PATCH: Faster PNG extractor



On 28 June 2013 18:37, Martyn Russell <martyn lanedo com> wrote:
On 28/06/13 07:45, Philip Van Hoof wrote:

Op 28/06/2013 8:30, Jonatan Pålsson schreef:

Hi Jonatan,


Hello Jonatan, Philip,

Hey Martyn!



The main reason for not modifying the original extractor is that I
want to keep it as a fallback if this new extractor fails due to an
unexpected file structure. Since png-faster tries to skip to the end
of the file by estimating the location of the metadata contained in
the end of the file using the file size & IDAT chunk size, I predict
it may fail more often than the original. Since tracker-extract
handles these failures gracefully, this is not a problem however.

The best way I can see to get a similar functionality in to the
existing extractor would be to modify libpng to allow skipping to the
end of the file (right now there is a comment in the existing png
extractor noting that this functionality is missing from the library),
but since reading the PNG format is relatively simple I opted to put
this functionality in the extractor rather than first patching libpng
(I am not sure how much work this would be, either).

What are your thoughts on keeping png-faster as a separate, optional
extractor module which can be enabled when extraction speed is of
primary concern?


After reading this, my opinion is that we should keep it as separate
modules. What do you think, Martyn?


I think that really depends on the diff. It's harder to see what the real
changes are with an entirely new module.

Yes, I agree. It is definitely more difficult to see what has changed
now.. I did not really base the new extractor on the code from the old
extractor, so it is a bit more difficult to get a diff. It would
however be possible to merge the two extractors (and thus more easily
get a diff), but this would take some time.


Ideally, I would like to share functionality where possible. That's the
first thing.

If it's a case of switching between fast() and slow() functions depending on
IDAT chunks then it's likely to be quicker over thousands of files, to do
that quick test first and then call the slow function in the one extractor
instead of switching out to the next module where we have to do a bunch of
new seeking from scratch. It's hard to say without looking at the real
changes and should be tested too. I've not looked in detail myself yet and I
plan to :)

I definitely see your point here. Failing one extractor to switch over
to the a different one several thousand times is a huge waste of time.
This would happen when png-faster fails to skip to the end of the
file, most likely due to the IDAT chunks being of variable size. I'd
like to point out that this should be an uncommon scenario however
(based on the fact that I have never seen such a file). If it turns
out to be much more common than I anticipate, the usefulness of
png-faster can be debated :) The worst case for png-faster which I can
think of, is if the same software/camera produces all PNG files
scanned by Tracker - and these PNGs have variable sized IDATs. This
would be bad.

I'm obviously partial here, due to the approach taken in png-faster,
but I like the idea of separating different extraction strategies into
different extractor modules. This means they can easily be disabled
and prioritized, etc. A different approach (which would be taken if
the two extractors are merged) would be to use #ifdefs within the
extractor module, and this means we can select extractors during
compile time, but only during compile time.

On a slightly different note, right now, some extractors can fall back
to some more generic extractor for example GStreamer, which is exactly
what I am going for in png-faster as well. The argument you make
concerning when the "faster" extractor fails is very valid for these
extractors as well, and I wonder.. Wouldn't it be nice to blacklist
certain extractors dynamically if they are prone to errors? Say
png-faster, or the mp3 extractor has failed five times in a row (or
several times during a short amount of time), and there is a more
generic extractor available, the specialized extractor could then
automatically be skipped by tracker-extract, and the extractor with a
lower priority could be chosen. With this functionality in place, the
original concern that png-faster fails very many times, should be
mitigated, while also possibly contributing to an overall performance
boost for the other modules which have more generic extractors
available. The blacklist could be kept in the memory of the
tracker-extract process, thus invalidating it after each mining (I
assume permanently faulty extractor modules are not common). Thoughts
on this?

--
Regards,
Jonatan Pålsson

Pelagicore AB
Ekelundsgatan 4, 6th floor, SE-411 18 Gothenburg, Sweden


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]