On Tue, 10 Oct 2006 16:12:08 -0700, Carl Worth wrote: > As before, I'll just attach them here, and follow up to add a bit of > analysis. First a quick scan of things that jump out from the results with the image backend: [ # ] backend-content test-size mean ms std dev. iterations [ 32] image-rgba paint_linear_rgb_over-128 22.498 0.18% 100 [ 33] image-rgba paint_linear_rgb_source-128 18.625 0.45% 100 [ 34] image-rgba paint_linear_rgba_over-128 22.498 0.16% 100 [ 35] image-rgba paint_linear_rgba_source-128 18.620 0.46% 100 [ 36] image-rgba paint_radial_rgb_over-128 355.267 0.75% 100 [ 37] image-rgba paint_radial_rgb_source-128 351.519 0.75% 100 [ 38] image-rgba paint_radial_rgba_over-128 356.131 0.61% 100 [ 39] image-rgba paint_radial_rgba_source-128 352.099 0.78% 100 Here we see that radial gradients are 17 times slower than linear gradients on this device. This is a big difference compared to the results on my x86 laptop where radial gradients are only 2 times slower than linear gradients. So this is definitely a problem spot, and I'm looking forward to watching how David Turner's gradient improvements help here. Next, I want to analyze the performance of the fundamental paint operation for both the image and the xlib backends. I'll use only the 512x512 pixel case since it should have the best numbers, (and the smaller cases seems to show similar trends): [ 60] image-rgba paint_solid_rgb_over-512 8.626 0.17% 100 [ 61] image-rgba paint_solid_rgb_source-512 8.625 0.18% 100 [ 62] image-rgba paint_solid_rgba_over-512 65.942 0.53% 100 [ 63] image-rgba paint_solid_rgba_source-512 8.634 0.17% 100 [ 64] image-rgba paint_image_rgb_over-512 17.172 0.18% 100 [ 65] image-rgba paint_image_rgb_source-512 17.162 0.19% 100 [ 66] image-rgba paint_image_rgba_over-512 77.566 0.47% 100 [ 67] image-rgba paint_image_rgba_source-512 9.414 0.23% 100 OK. There is some interesting data to be seen above. First, let's assume that the 8-9 ms time represents a well-optimized blit speed. So, in the case of a solid color, we're getting that good speed in the 3 cases that are blits (rgb_over, rgb_source, and rgba_source). And when the source pattern is an image instead of a solid color we also get a good speed for the blit case (rgba_source). Two other cases (rgb_over and rgb_source) are slightly harder than blits since we have to expand the data to include a constant alpha channel that does not exist in the source surface. These cases are 2x slower than a blit. Is that expected? Or could we easily do better than that? Meanwhile, the slowest cases above are the two where we are actually doing something "hard", (having to blend a source surface over a destination where both have alpha). These are the solid_rgba_over and image_rgb_over cases. Currently they are 8x slower than a blit. Is that just what it costs to do the multiplication of the blend? Maybe. Let's next look at how the same cases change when there is no alpha channel in the destination. I'll show only the rows that have significantly different numbers than above: [ 64] image-rgb paint_image_rgb_over-512 9.623 0.23% 100 [ 65] image-rgb paint_image_rgb_source-512 9.611 0.22% 100 [ 67] image-rgb paint_image_rgba_source-512 12.140 0.42% 100 The rgb_over and rgb_source cases have now changed from "copy and augment with constant alpha" to "simple blit" and the numbers reflect that. That's good. The rgba_source case is a bit funny. We've got an alpha channel in the source surface, but not in the destination. I'm not quite sure what the semantics are of SOURCE in that case. Is it a simple blit still, (just not caring what we put in the unused bits of the destination)? If so, why is it 25% slower? If not, what extra work is it doing? It's obviously not costing as much as the complementary case where a SOURCE from rgb->rgba is 2x slower than a blit. So comparing the rgb->rgba and rgba->rgb implementations might be useful. OK, that's the image backend. Now let's look at these same cases with the xlib backend: [ 60] xlib-rgba paint_solid_rgb_over-512 9.672 0.47% 100 [ 61] xlib-rgba paint_solid_rgb_source-512 9.663 0.47% 100 [ 62] xlib-rgba paint_solid_rgba_over-512 436.860 0.45% 100 [ 63] xlib-rgba paint_solid_rgba_source-512 9.627 0.46% 100 [ 64] xlib-rgba paint_image_rgb_over-512 200.226 0.55% 100 [ 65] xlib-rgba paint_image_rgb_source-512 179.953 0.56% 100 [ 66] xlib-rgba paint_image_rgba_over-512 142.724 0.56% 100 [ 67] xlib-rgba paint_image_rgba_source-512 62.047 0.47% 100 Here we can see some of the same patterns as in the image case. Solid color blits are all acting well. But here, the image_rgba_source which was a fast blit above now has a 5x performance hit. So there's a definite performance bug there. Also, the image_rgb_over and image_rgb_source cases which are "blit and set alpha channel to constant" are 18-20 times slower than blits, (compared to 2x slower with the image backend), so that looks like another performance bug. Finally, the solid_rgba_over case is horrible. With the image backend it was 8x slower than a blit, here it is 45x slower! Remarkably, this solid-color case is 3x slower than the image_rgba_over case. Something's really broken when it takes cairo 3 times longer to render a solid color than a complete image. Meanwhile that image blending itself is almost 15x slower than a blit, (compared to the image backend where there was only a 8x slower). In addition, with the xlib backend, we can look at what happens when the source pattern is an xlib surface rather than an image surface: [ 68] xlib-rgba paint_similar_rgb_over-512 148.130 0.56% 100 [ 69] xlib-rgba paint_similar_rgb_source-512 127.762 0.29% 100 [ 70] xlib-rgba paint_similar_rgba_over-512 91.166 0.40% 100 [ 71] xlib-rgba paint_similar_rgba_source-512 10.681 0.44% 100 There's one very encouraging point here, namely that the rgba_source time is down close to what we expect for a blit, (just about 10% slower than sold_rgba_source). So it looks like we've got at least one thing right in the xlib backend! The other cases here are also faster than the corresponding image-surface source pattern cases with the xlib backend, but not as significantly. The rgb_over and rgb_source "blit and set alpha channel to constant" cases are 13-15x slower than a blit (compared to 18-20x with image surfaces sources with the xlib backend), but still not comparing favorably with the image backend where these cases are only 2x slower than a blit. Finally, the "hard" case of actually blending one surface over another (rgb_over) is here about 9x slower than a blit (rgba_source). That does compare quite favorably with the 8x we saw in the image backend. So there are definitely some performance bugs in the xlib backend. It will probably take a combination of fixes in both cairo and the X server to address all of these. Some of the cairo fixes should be really easy, (things like replacing OVER with SOURCE if the source pattern has no alpha channel). Almost any clearly identified performance bug in the above can be fixed by appropriately calling code that already exists, so that's encouraging. Finally, let's look at what happens when the xlib destination surface does not have an alpha channel: [ 60] xlib-rgb paint_solid_rgb_over-512 5.208 0.86% 100 [ 61] xlib-rgb paint_solid_rgb_source-512 5.204 0.89% 100 [ 62] xlib-rgb paint_solid_rgba_over-512 537.259 0.53% 100 [ 63] xlib-rgb paint_solid_rgba_source-512 5.122 0.77% 100 [ 64] xlib-rgb paint_image_rgb_over-512 218.250 0.14% 100 [ 65] xlib-rgb paint_image_rgb_source-512 199.015 0.14% 100 [ 66] xlib-rgb paint_image_rgba_over-512 176.539 0.13% 100 [ 67] xlib-rgb paint_image_rgba_source-512 197.005 0.12% 100 [ 68] xlib-rgb paint_similar_rgb_over-512 5.917 0.74% 100 [ 69] xlib-rgb paint_similar_rgb_source-512 5.918 0.72% 100 [ 70] xlib-rgb paint_similar_rgba_over-512 125.132 0.10% 100 [ 71] xlib-rgb paint_similar_rgba_source-512 145.492 0.10% 100 Interestingly, all of the blit speeds got nearly twice as fast. Perhaps someone more familiar with the details of this X server could easily explain why that is. The remainder of the tests seemed to follow a pattern similar to the image backend. Wow, that was a lot of prose and a lot of numbers. I don't know if anyone is really going to absorb all that. It probably would have been better for me to rewrite this by grouping the operations that should have similar performance characteristics. That would have made the problematic cases stand out much better. But it's late, and I'd rather just send this out now that I've typed it all up. Looking forward to lots of good improvements... -Carl
Attachment:
pgpk675kctW9Y.pgp
Description: PGP signature