Re: Speeding up thumbnail generation (like multi threaded). Thoughts please.



On Sat, Aug 29, 2009 at 1:04 AM, Christian Hergert<chris dronelabs com> wrote:
>
>> On Fri, Aug 28, 2009 at 11:49 PM, Christian Hergert<chris dronelabs com>
>>  wrote:
>>>
>>> Hi,
>>>
>>> What you mentioned is good information to start hunting.  Was the CPU
>>> time
>>> related to IO wait at all?  Always get accurate numbers before
>>> performance
>>> tuning.  "Measure, measure, measure" or so the mantra goes.
>>
>> Perhaps a stupid question but what is a good way of profiling io? cpu
>> is easy but i've never done io.
>> In this case my hdd is certainly able to do more then 10 thumbnails
>> per second however i could see a potential issue when someone with a
>> slower hdd and a faster cpu then mine is thumbnailing a lot of images.
>> There the hdd will likely be the bottleneck.
>
> You can do something really crude by reading from /proc/pid/* (man proc for
> more info).  Or you could try using some tools like sysstat, oprofile,
> system-tap, etc.  We really need a generic profiling tool that can do all of
> this stuff from a single interface.  However, at the current time, I've been
> most successful with just writing one off graphing for the specific problem.
>  For example, put in some g_print() lines and grep for those and then graph
> them using your favorite plotter or cairo goodness.
>
>>> Unfortunately, the symptom you see regarding IO will very likely change
>>> under a different processing model.  If the problem is truly CPU bound
>>> then
>>> you will only be starting IO requests after you were done processing.
>>>  This
>>> means valuable time is wasted while waiting for the pages to be loaded
>>> into
>>> the buffers.  The code will just be blocking while this is going on.
>>
>> And how can i test that?
>
> ltrace works for simple non-threaded applications.  Basically you should see
> in the profiling timings that one work item happens sequentially after the
> previous such as (load, process, load, process, ...)
>
> I would hate to provide conjecture about the proper design until we have
> more measurements.  It is a good idea to optimize the single threaded
> approach before the multi-core approach since it would have to be done
> anyway and is likely less complex of a problem before the additional
> threads.
>
>>> What could be done easily is every time an item starts processing it
>>> could
>>> asynchronously begin loading the next image using gio.  This means the
>>> kernel can start paging that file into the vfs cache while you are
>>> processing the image.  This of course would still mean you are limited to
>>> a
>>> single processor doing the scaling.  But if the problem is in fact cpu
>>> bound, that next image will almost always be loaded by time you finish
>>> the
>>> scale meaning you've maximized the processing potential per core.
>>
>> That sounds like a nice way to optimize it for one core. But could
>> there be any optimization possible in my case? since i have 100% cpu
>> usage for one core with just the benchmark.
>
> You can't properly optimize for the multi-core scenario until the
> single-core scenario is fixed.
>
>>> To support multi-core, like it sounds like you want, a queue could be
>>> used
>>> to store the upcoming work items.  A worker per core, for example, can
>>> get
>>> their next file from that queue.  FWIW, I wrote a library, iris[1], built
>>> specifically for doing work like this while efficiently using threads
>>> with
>>> minimum lock-contention.  It would allow for scaling up threads to the
>>> number of cores and back down when they are no longer needed.
>>>
>> That sounds very interesting.
>> Just one question about the queue. Would it be better to thread the
>> application (nautilus) or the library (glib)? If your answer is the
>> library then the queue has to be passed from nautilus to glib. I would
>> say glib because all application have benefit from it without
>> adjusting there code.
>
> I haven't looked at this code in detail yet, so I cannot confirm or deny.
>  My initial assumption would be that the thumb-nailing API (again, I have no
> experience with it yet) should be restructured around an asynchronous design
> (begin/end methods) and the synchronous implementation built around that.
>  And of course, nobody should use the synchronous version unless they
> *really* have a reason to.
>
> FWIW, I would be willing to help hack on this, but I'm swamped for at least
> the next few weeks.
>
> -- Christian
>

Oke, i did some ltracing now. Here is the ltrace output for 5 images
that get thumbnailed. I hope anyone here could help me a bit in
explaining this output since it's not all obvious to me.
Before you read the output. i put a 2 second sleep after a image is
done and the pixbuffs are removed from memory. This makes it a lot
more easy to identify where the image loading/scaling/saving is
happening.
The output is here: http://codepad.org/ydtCcJ5d

Now for the splitting up that i did to sort out io and cpu.
This is all in the 1251553123 timeframe so it's just one image.

1251553123.625556
_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E(0x613de0,
0x701ac0, 16, 15416, 0x7ffffa4b2a90) = 0x613de0
1251553123.625635 _ZNSsC1Ev(0x7ffffa4b2f90, 0x6ed100, 8, 8,
0x5f7974696e616d75) = 0x7fcaa64de158
1251553123.625683
_ZNSt18basic_stringstreamIcSt11char_traitsIcESaIcEE3strERKSs(0x7ffffa4b2de0,
0x7ffffa4b2f90, 0x7ffffa4b2f90, 8, 0x5f7974696e616d75) = 0x781578
1251553123.625733 _ZNSsD1Ev(0x7ffffa4b2f90, 0x781578, 0x781578, 0,
0x781778) = 0x7fcaa64de158
1251553123.625781
_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc(0x7ffffa4b2df0,
0x40c950, 16, 0, 0x781778) = 0x7ffffa4b2df0
1251553123.625830
_ZStlsIcSt11char_traitsIcESaIcEERSt13basic_ostreamIT_T0_ES7_RKSbIS4_S5_T1_E(0x7ffffa4b2df0,
0x701ac0, 88, 88, 0x2f6e6f6974616d69) = 0x7ffffa4b2df0
1251553123.625878
_ZNKSt18basic_stringstreamIcSt11char_traitsIcESaIcEE3strEv(0x7ffffa4b2fa0,
0x7ffffa4b2de0, 0x7ffffa4b2de0, 88, 0x5f7974696e616d75) =
0x7ffffa4b2fa0
1251553123.625929 _ZNSsaSERKSs(0x7ffffa4b2f60, 0x7ffffa4b2fa0,
0x7ffffa4b2fa0, 0x7fcaa64de140, 0x69525f7974696e61) = 0x7ffffa4b2f60
1251553123.625977 _ZNSsD1Ev(0x7ffffa4b2fa0, 0, 0x786c50, 0x5cf210,
0x69525f7974696e61) = 0x782018

I have no clue what is going on in those _Z* functions.. what are those?

1251553123.626120 gdk_pixbuf_new_from_file(0x782018, 0x7ffffa4b2f88,
0x7ffffa4b2f88, 0x4a992f63, 0x69525f7974696e61 <unfinished ...>
1251553123.626161
SYS_open("/home/mark/ThumbnailingBenchmarks/2000_Wallpapers-Reanimation/Sea_of_Humanity_Rio_de_Janeiro_Brazil."...,
0, 0666) = 4
1251553123.626216 SYS_fstat(4, 0x7ffffa4b1b20)                             = 0
1251553123.626238 SYS_mmap(0, 4096, 3, 34, 0xffffffff)
    = 0x7fcaa928a000
1251553123.626273 SYS_read(4, "\377\330\377\340", 4096)
    = 4096
1251553123.626355 SYS_lseek(4, 0, 0)                                       = 0
1251553123.626379 SYS_rt_sigprocmask(0, 0, 0x7ffffa4a1af0, 8, 0)           = 0
1251553123.626417 SYS_read(4, "\377\330\377\340", 65536)
    = 65536
1251553123.636544 SYS_read(4,
"Ob\316Z\330\271\003\030\007\004TJ\t-\312N\347f>i\001\344\236\347\332\271/\023\370b\373Z\324\214\366\332\2040\256\320\241Y7c\025\240\276"\201\320\223\003\257\004",
65536) = 65536
1251553123.643681 SYS_read(4,
"\221\301\256\332)u9gwt\212_\331r\\[[\264eS1\202\304\234\222ME\007\207\334\027\3350\371rO<Tz\307\332-\322\3365\221\227l8
\022\006k*\326[\247.\032W#\034\214\340\032\350\216\333\034\365%\016v\254n\256\213\003\002~\320\243\007\200Mmj7\247G\360\235\265\265\276\331U\345b\\"...,
65536) = 65536
1251553123.650134 SYS_read(4,
"\3749w\241\276\254/\2368\334\022eq\202\207\270\375j\343Bo\2315\251\234\261t\242\242\357\243\330\304\360\307\207\216\257i`\333\205\241\215\216\340\374\345Go\326\275 \337\341\235\265\314\022\006\324\345X\336M\376 \031A\\\356\203\033\266\224#\323\360#\212\365\241R\314\001\233\216\325\351\220=\325\207\207#i%\020\334L"...,
65536) = 65536
1251553123.656189 SYS_read(4,
"\357\255\254\214w\321\233\231Y\367o\017\234\n\332p \004>7s\323\004R\205<",
65536) = 65536
1251553123.662128 SYS_read(4,
"\272\206\250\222i\236'm\303\316{\333wU'\223\217\363\372S\224\264\320\342\204/d\374\277\366\323\242\370e\243\330H\251w\250Zl\235'\016\222\311\300UA\226\030\374\253WA:w\210<W\255\335\315\022\\K\347f\022\343;Tq\376\025\3473k\232\234\227q\013\206o(n\350pM%\216\275\253\370f\376\333R"...,
65536) = 65536
1251553123.668090 SYS_read(4, "\016p", 65536)
    = 65536
1251553123.673774 SYS_read(4, "s\00426A\367\377", 65536)
    = 23360
1251553123.673819 SYS_read(4, "", 40960)                                   = 0
1251553123.676090 SYS_close(4)                                             = 0
1251553123.676119 SYS_munmap(0x7fcaa928a000, 4096)                         = 0
1251553123.676199 <... gdk_pixbuf_new_from_file resumed> )
    = 0x7818f0

Here is a obvious file read going on but what i don't understand is
why are there multiple SYS_read commands? I thought it was just
reading the contents of one file in one read command..

1251553123.676269 printf("\n - %f", ... <unfinished ...>
1251553123.676301 SYS_write(1,
"Sea_of_Humanity_Rio_de_Janeiro_Brazil.jpg\n", 42) = 42
1251553123.676364 <... printf resumed> )

So, printf is writing to a file as well? what's happening here?
Then right after 1251553123.676364 there are 9 more _Z* functions..
what do they do?

1251553123.676818 gdk_pixbuf_get_width(0x7818f0, 200, 2, 200,
0x7974696e616d7548) = 1600
1251553123.676866 gdk_pixbuf_get_height(0x7818f0, 200, 0x75a310, 200,
0x7974696e616d7548) = 1200
1251553123.676915 gdk_pixbuf_scale_simple(0x7818f0, 200, 150, 2,
0x7974696e616d7548) = 0x781940
1251553123.682973

Actual image downscaling to a thumbnail is here. obviously cpu stuff
and quite fast.

1251553123.683186 gdk_pixbuf_save(0x781940, 0x782d18, 0x40c9f7,
0x7ffffa4b2f88, 0 <unfinished ...>
....
range of memcopy functions
1251553123.688447 memcpy(0x0079cc02,
"D\274Y\257\255\313u\0356\306\234U\337\267\326n\3169\373\364\315mx\311\313{\331\211b\003Q\246%'\024lZ\nd\305\266\002\004Nl
H\203", 8190) = 0x0079cc02
1251553123.688507 SYS_write(4, "\211PNG\r\n\032\n", 4096)
    = 4096
1251553123.688547 SYS_write(4,
"\234\335d\304\332D\262)\213`i\222\212a\273", 4096) = 4096
....
and that roughly 4 more times. I guess a total of ~100 memcpy's

I have a few questions here
- Why are there so many memcopy functions? and could it be possible
that it's a memcopy for every horizontal pixel line (since that would
be roughly equal to the amount of memcpy commands)
- Why is it doing memcpy (~20 times) the writing to a file then memcpy
again then writing again till the image is done.. is that logical? I
might miss the point but i don't see the logic in that. It seems to
write chuncks os data instead of writing all in one go.

1251553123.706736 SYS_write(4,
"\245\335\375*\300P\024\2067wQI>\250.\036t;"0,\014\006G\243\370\231\363\307'O>\363\351\373\357;=\256\\\255\356\bDP\004\203Q\230\240\210X"j\001\0042,\233\303\006\001S\024\005W<\256\313r\334\311d\377\367\022m\243\235\241\310\251\013_\335*T\364\203\365\007\237/\375\366Z&\031\333z"...,
3091) = 3091
1251553123.706790 SYS_close(4)                                             = 0
1251553123.707040 SYS_munmap(0x7fcaa928a000, 4096)                         = 0
1251553123.707090 <... gdk_pixbuf_save resumed> )                          = 1

The image is saved to the drive here.
After that some more _Z* functions and some gdk unref function for
memory that was reserved. not much strange there.

So, if i'm correct (please say so if i'm not) every single *_read*
function is a io call. same with write and close. But how do i get the
numbers from this that you wanted, Cristian?  It is showing the read
-- cpu -- write part (read as in reading the origional image, cpu as
in downscaling it, write as in saving it)

Sorry for asking so much questions about this but it's my first time
with this level of profiling. and i really would like to learn and
just know this kind of stuff.

Thanx for your patience,
Mark.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]