Bug 69515

Summary: [Qt] Parallel imagedecoders
Product: WebKit Reporter: Zoltan Horvath <zoltan>
Component: ImagesAssignee: Nobody <webkit-unassigned>
Status: RESOLVED INVALID    
Severity: Normal CC: hausmann, kbalazs, kevin.simons, kling, loki, skyul
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: All   
OS: Linux   
Bug Depends on: 71555    
Bug Blocks:    
Attachments:
Description Flags
approach #1 none

Description Zoltan Horvath 2011-10-06 04:09:18 PDT
Created attachment 109942 [details]
approach #1

Hey!

I made some investigations on how we can do imagedecoding on dedicated thread. I tried to use WebCore imagedecoders instead of current QImageDecoder implementation, because apparently it is the only way to a generic solution. 

I upload the patch for discussion and not for review, so please don't comment style issues and other violations. :)

About my current implementation:

* 1st I had to change to WebCore's imagedecoders, it was simple. QtWebKit is working well with WebCore's imagedecoders since both approach are using the same system libraries on my system.
* 2nd I had to change QPixmaps to QImages since unfortunately, QPixmap doesn't tolerate manipulating images from thread. (Unfortunately, this brings some performance loss, see below.)
* 3rd I added a dedicated worker thread with a queue for image decoding.
 * We're doing the imagedecoding in the mainthread until we have determined the size (w,h) of the image then we choose to work either in the mainthread or in the workerthread.
 * I tried to avoid copying the raw image data to the thread, but in this case I had to use locks (I didn't remove ReadWrite lock implementation from this patch, but it isn't used currently.) and it didn't show satisfying performance results. (e.g. mainthread had to wait for imgdecoding completion at a point)
 * In this patch I'm copying the raw data to the thread (memory consumption :( ), so I can avoid locks, which leads to performance gain in some cases.
 * We can render images (what has been decoded by worker thread) only when the whole image is decoded, so WK with this patch won't show the part of the image until decoding is fully finished. (This is a limitation of decoderlibs e.g. JPEG, we can't access decoded data during decoding.)

* Note: the patch doesn't work for cases that trigger decoding from main thread in case of big images (width||height > 2000px) e.g. drag and drop image with mouse
* Note: I chose 2000px for size limit, because I didn't measure performance gain for usual size (500px-100px) images.
* Note: I commented out GIF imagedecoding for the present.
* Note: WebCore.pri contains paths to my local png1.5.4 library because I had problems with png1.2 in threaded case.

Some numbers:

* Test case (goal to measure performance gain): 3 big (decoding on workerthread) and 3 small image (decoding on mainthread)
 * http://zoltan.sed.hu/WebKit/methanol_imgdecoder/sample/images.htm
 * Reference: avg 2770 ms +/-0.3% 
 * ThreadedImageDecoding: avg 1432 ms +/-4.4%
 * The results looks pretty cool!

* Test case (goal to measure overhead of the patch): simple Methanol run with 24 locally mirrored popular sites from Alexa top 100 (server: apache)
 * http://zoltan.sed.hu/WebKit/methanol_imgdecoder/fire.html?iter=1
 * Reference: avg 9180ms +/-3.3% 
 * Threaded: avg 12465ms +/-7.5%
 * In this case threaded version didn't run since Methanol doesn't contain big enough images.
 * The results looks woeful.
  * QPixmap -> QImage change
  * Conversions 
  * Data copying
  * Other?

I just want to talk about my approach... What could we do to decrease the overhead of the patch? Does it make sense to use QImage? etc...
Comment 1 Balazs Kelemen 2011-10-10 08:35:49 PDT
Some thoughts and questions after a brief overview:
  - It is a serious limitation that we can only paint the image when it has been fully downloaded. We should discuss whether it is acceptable in a user experience point of view.
  - This approach needs to be generalized across ports. I guess every port has a thread-safe image format so it should be possible.
  - Do you think Qt5 has a way to make using QImage not so slow?
  - Do you think it would be possible to still use QPixmap for smaller images that are decoded on the main thread and QImage for the big ones?
Comment 2 Zoltan Horvath 2011-10-11 01:03:57 PDT
I measured the overhead of the patch with starting the testbrowser with "-graphicssystem raster" parameter, in this way the overhead is decreased to a negligible level, so the considerable overhead is coming from the QPixmap -> QImage modification.
Comment 3 Zoltan Horvath 2011-10-11 01:09:37 PDT
(In reply to comment #1)
>   - Do you think it would be possible to still use QPixmap for smaller images that are decoded on the main thread and QImage for the big ones?

It's possible, we need to turn some functions to template functions or duplicate code. I wonder whether it is worth powder and shot.
Comment 4 Zoltan Horvath 2011-11-03 05:11:01 PDT
Simon, what is your opinion about get rid of using QImageDecoder.* things and switch to WebCore's imagedecoders? Both use the same system libraries, and produce the same results on benchmarks. 

Is there any blocker (api, compatibility, etc...) to switch to the WebCore's imagedecoder implementation?
Comment 5 Zoltan Horvath 2011-12-01 04:45:45 PST
I have a bug for WebCore vs. QtImageDecoder comparison: bug #71555

I made some measurements with libjpeg-turbo, it's related a bit to this topic, so I share my results with you.
About libjpeg-turbo: http://libjpeg-turbo.virtualgl.org/
"libjpeg-turbo is a derivative of libjpeg that uses SIMD instructions (MMX, SSE2, NEON) to accelerate baseline JPEG compression and decompression on x86, x86-64, and ARM systems. On such systems, libjpeg-turbo is generally 2-4x as fast as the unmodified version of libjpeg, all else being equal."

Sounds delicious :) Let's see the numbers:

PC: Intel(R) Core(TM)2 Duo CPUE6550@2.33GHz (2 cores)
Slackware 13.1 - 32bit, Qt4.8, r91038,

http://zoltan.sed.hu/WebKit/methanol_imgdecoder/fire.html?iter=1 
libjpeg-turbo (1.1.0): avg 14 339 ms (+/-4.8%) min: 13 195 ms max: 14 897 ms 
libjpeg (v8a): avg 22 423 ms (+/-7.1%) min: 20 926 ms max: 25 140 ms

libjpeg-turbo is 36.1% faster than libjpeg on my imgdecoder-specific benchmark.

Slackware 13.1 - 32bit, Qt4.8, r91038 
http://zoltan.sed.hu/WebKit/methanolx/fire.html?iter=1 
libjpeg-turbo (1.1.0): avg 8 055 ms (+/-15.5%) min: 6 572 ms max: 9 768 ms 
libjpeg (v8a): avg 8 946 ms (+/-11.7%) min: 7 455 ms max: 9 824 ms 

libjpeg-turbo is 10% faster than libjpeg on Methanol benchmark.

It may performs better with more than 2 cores. Impressive gain even on small pictures! What is your opinion guys?
Comment 6 Kwang Yul Seo 2012-02-21 00:19:57 PST
The current design delays decoding until painting, but your patch seems to decode an image as soon as all data is received. It means the new parallel image decoder eagerly decodes all the images even though many of them end up not being drawn to screen. Is this an acceptable change considering the performance gain?
Comment 7 Zoltan Horvath 2012-02-21 00:56:13 PST
(In reply to comment #6)
> The current design delays decoding until painting, but your patch seems to decode an image as soon as all data is received. It means the new parallel image decoder eagerly decodes all the images even though many of them end up not being drawn to screen. Is this an acceptable change considering the performance gain?

I think this is not a blocker problem. The main problem is that with this approach we can't achieve satisfying performance gain. For big images this works perfectly, but for the usually used relatively small images it's too slow (you can't make decision about which image should be decoded first.), furthermore in case of user events (drag&drop) you need a lots of extra copying. Monil Parmar was trying to implement this on gtk side, but he finished his examinations with the same result.

We can improve on the a library side I think. E.g. as I know chromium changed to libJPEG turbo last week. :)
Comment 8 Kwang Yul Seo 2012-02-21 01:28:19 PST
(In reply to comment #7)
> I think this is not a blocker problem. The main problem is that with this approach we can't achieve satisfying performance gain. For big images this works perfectly, but for the usually used relatively small images it's too slow (you can't make decision about which image should be decoded first.), furthermore in case of user events (drag&drop) you need a lots of extra copying. Monil Parmar was trying to implement this on gtk side, but he finished his examinations with the same result.

Yes, I agree. Hyung Song did the same experiment last November on WebKit SDL port and got the exact same result.

http://dev.dorothybrowser.com/?p=276

> We can improve on the a library side I think. E.g. as I know chromium changed to libJPEG turbo last week. :)

It looks promising. Thanks for the information.
Comment 9 Zoltan Horvath 2012-03-06 01:41:56 PST
Based on my and others experiments we couldn't achieve improvement on common cases, so I close the bug as invalid.
Comment 10 Zoltan Horvath 2012-06-14 02:45:51 PDT
*** Bug 40159 has been marked as a duplicate of this bug. ***