Bug 90375

Summary: Parallel image decoders
Product: WebKit Reporter: Kwang Yul Seo <skyul>
Component: ImagesAssignee: Nobody <webkit-unassigned>
Status: ASSIGNED ---    
Severity: Normal CC: ap, ararunprasad, arurajku, benjamin, choco, cmarcelo, dev, dglazkov, d.nomiyama, donggwan.kim, dongseong.hwang, eric, galpeter, gram, gyuyoung.kim, hclam, huangxueqing, joone, laszlo.gombos, levin+threading, loic.yhuel, menard, mifenton, mrowe, nick, noam, noel.gordon, pkasting, psolanki, rafael.lobo, rakuco, rdanailova6, s.choi, sergio, simon.fraser, syoichi, tomhudson, tonyg, webkit.arunp, webkit.review.bot, yong.li.webkit, yurys, zoltan
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 90584, 90935, 91172, 91203, 92470, 93464, 93465, 93467, 93590, 94190, 90443, 90721, 90755, 90756, 90757, 90758, 90780, 90869, 91075, 93466    
Bug Blocks: 93171    
Attachments:
Description Flags
Parallel image decoders (not for review)
none
Parallel image decoders (not for review)
buildbot: commit-queue-
Draft patch for refactoring in BitmapImage and Image
none
Remove Image::currentFrameHasAlpha()
none
Remove Image::currentFrameHasAlpha()
none
Parallel image decoders based on Lam's preparation patches.
none
Remove sync image decoding from currentFrameHasAlpha and currentFrameOrientation
none
Parallel image decoders based on Lam's preparation patches. none

Description Kwang Yul Seo 2012-07-02 05:35:08 PDT
This is an umbrella bug for our new parallel image decoders. Our team at Company 100 has worked on parallel image decoders and resolved several issues addressed in Bug 69515.

About our implementation:
- We parallelize WebCore's image decoders. Currently, only CG does not use WebCore's image decoders so many ports can get performance boosts from this work.
- We used a pool of worker threads (n = number of cores) instead of a single dedicated worker, so that we can decode multiple images simultaneously.
- We support incremental image decoding. It menas we can render images while images are being decoded.
- We avoid the locking overhead as much as possible by copying raw data to and from the decoder thread.
- If we decode image in the main thread (e.g., when the image size is small), we avoid the locking overhead entirely.
- We passed all the layout tests. We have no regression regarding image decoding.


We selectively use parallel image decoders:
- We chose to use parallel image decoders only when rendering HTML image, SVG image, CSS box model background, border and mask. Note that rendering HTML image element in canvas does not use parallel image decoders. Because some of image decoding in WebCore are inherently synchronous, it was not easy to paralleize all image decoding use cases.
- We’re doing the image decoding in the main thread until we have determined the size (w, h) of the image and then we choose to work either in the main thread or in the worker thread. We work in the main thread if the image size is smaller than 262144px (512x512) because the performance is not so good when the image size is small. The current limit is arbitrary, so we need to adjust the limit with more extensive experiments.
- ICO and GIF formats are not supported. These formats support multiple frames, which complciates the implementation of parallel image decoders. Because ICO and GIF images are usually very small, we don't get much performacne gain anyway by supporting these formats.


Results:

* Test Case: DOMContentLoaded (5 times)
 * http://ie.microsoft.com/testdrive/HTML5/DOMContentLoaded/Default.html
 * Reference: 850 ms (stddev: 97 ms)
 * Threaded: 817 ms (stddev: 90 ms)

 * Reference (page cached): 233 ms (stddev 0ms)
 * Threaded (page cached): 146 (stddev 4ms)

* Test Case: 35 Images (5 times)
 * http://www.dorothybrowser.com/test/webkitTest/imgdecode/bgimage-png/test.html
 * Reference: 3759 ms (stddev: 229 ms)
 * Threaded: 3652 ms (stddev: 117 ms)

 * Reference (page cached): 1204 ms (stddev: 11ms)
 * Threaded (page cached): 58 ms (stddev: 48ms)

We will update the result with Methanol test used in Bug 69515.

Dongsung Huang who is the primary of author of our parallel image decoder will upload patches for review in a couple of days once we finish our internal review process.
Comment 1 Kwang Yul Seo 2012-07-02 05:46:08 PDT
All the above results are measured on a 6-core Xeon machine.
Comment 2 Kwang Yul Seo 2012-07-05 19:40:54 PDT
Methanol test (Alexa Top 25):
http://www.dorothybrowser.com/test/methanol/fire.html

Reference:
- Summary: 15417.1 ms (15.530825250726393%) 

Threaded:
- Summary: 15148.2 ms (11.760063727481292%)

It seems our implementation does not show massive slowdown appeared in Bug 69515.
Comment 3 Kwang Yul Seo 2012-07-06 04:40:21 PDT
Please refer to the following document for the detailed design and implementation of our parallel image decoders:

- Parallel Image Decoding Design and Implementation
https://docs.google.com/document/pub?id=12gf7MhNHfupeR3GdRF-h2Vzg86viCbwXq8_JByHKlg0


We are cleaning up the remaining patches for review. They will be available soon.
Comment 4 Alexey Proskuryakov 2012-07-09 00:51:43 PDT
Has this work been discussed on webkit-dev? It's not good form to post a number of patches with r? flag otherwise.
Comment 5 Kwang Yul Seo 2012-07-09 01:06:14 PDT
(In reply to comment #4)
> Has this work been discussed on webkit-dev? It's not good form to post a number of patches with r? flag otherwise.

Sorry. I will send an email to webkit-dev now.
Comment 6 Benjamin Poulain 2012-07-09 14:31:19 PDT
The results looks weird:
 * Reference (page cached): 1204 ms (stddev: 11ms)
 * Threaded (page cached): 58 ms (stddev: 48ms)
How many core do you have on the test machine that you have a 20x performance boost?

 * Reference: 3759 ms (stddev: 229 ms)
 * Threaded: 3652 ms (stddev: 117 ms)
Why is the gain so small in this case?

I don't know much about Methanol but I have seen ridiculous claim coming from it before so I would prefer if you have some analysis of the results.
Comment 7 Kwang Yul Seo 2012-07-09 16:06:56 PDT
(In reply to comment #6)
> The results looks weird:
>  * Reference (page cached): 1204 ms (stddev: 11ms)
>  * Threaded (page cached): 58 ms (stddev: 48ms)
> How many core do you have on the test machine that you have a 20x performance boost?

I realized that the test result here is a bit misleading because we measured the page loading time, not the actual image decoding time. So even after page loading is finished, it is still possible that many images are not actually decoded.

I will measure time to the first paint instead as Maciej suggested in webkit-dev mailing list.

> 
>  * Reference: 3759 ms (stddev: 229 ms)
>  * Threaded: 3652 ms (stddev: 117 ms)
> Why is the gain so small in this case?

This test case contains 35 big images so it spends most of time in fetching data from network if the images are not already in the cache. Because WebKit has plenty of spare time to decode, it is hard improve page loading time by just speeding up image decoders.

> 
> I don't know much about Methanol but I have seen ridiculous claim coming from it before so I would prefer if you have some analysis of the results.

Methanol is a kind of page load test. We locally cached Alexa Top 25 sites. Methanol automatically loads each page 10 times and measures page loading time. 

Because most of these sites do not have big images, parallel image decoders are seldom used. We wanted to make sure that there is no performance regression in this case.
Comment 8 Zoltan Horvath 2012-07-10 01:52:09 PDT
(In reply to comment #6)
> I don't know much about Methanol but I have seen ridiculous claim coming from it before so I would prefer if you have some analysis of the results.

I think it was not the problem of Methanol, rather than misuse of it or something else. You can check Methanol here: http://gitorious.org/methanol It's not dark magic, just a JS like we have in performance tests. 
Please don't be so negative about Methanol. :)
Comment 9 Kwang Yul Seo 2012-07-10 03:34:02 PDT
Created attachment 151429 [details]
Parallel image decoders (not for review)

Parallel image decoders code for preview. (not for review). Currently, it builds only on WebKit Qt Port (both Qt4 and Qt5).

I will upload the patch again once we fix build for other ports.
Comment 10 Yong Li 2012-07-10 07:10:49 PDT
One of the difficulties of parallel image decoding is it is not always safe to call the observers in main thread, because that can trigger JS execution.

Please take a look at PageGroupLoadDeferrer and relevant code. For example, when a JS modal dialog is being showed, no JS event is supposed to be dispatched.

The other side of this problem is that one image source can have different observers from different pages, so it cannot get a pointer to a page. One solution may be making the load defer state globally accessible.
Comment 11 Yong Li 2012-07-10 07:17:43 PDT
I would pre-"r-" until the page deferring issue is addressed.
Comment 12 Kwang Yul Seo 2012-07-10 23:31:13 PDT
(In reply to comment #10)
> One of the difficulties of parallel image decoding is it is not always safe to call the observers in main thread, because that can trigger JS execution.
> 
> Please take a look at PageGroupLoadDeferrer and relevant code. For example, when a JS modal dialog is being showed, no JS event is supposed to be dispatched.
> 
> The other side of this problem is that one image source can have different observers from different pages, so it cannot get a pointer to a page. One solution may be making the load defer state globally accessible.

Thanks for the comment. We will investigate this issue!
Comment 13 Kwang Yul Seo 2012-07-11 00:13:31 PDT
You can check out the updated test results in the following URL:

https://docs.google.com/spreadsheet/pub?key=0Ar2smwimcenMdGpnTEw2clZjSTNkbXNFNFM5dkYyRGc&output=html

A few notes:
- Adjust the threshold from 512x512 to 300x300
- Measure first painting time to make sure all the images are decoded.
- Add flickr.com: It is hard to measure page loading time (or first paint time) accurately because page loading time varies a lot due to dynamic resource requests of scripts. So we locally mirrored the site and removed scripts. 

As mentioned above, parallel image decoders can't improve page loading time (or first paint time) dramatically if images are not cached, because it takes a long time to fetch large images and image decoding can be performed in the meantime (interleaved with loading). If images are already cached, then parallel image decoders shine! 

However, there is still limitation. Even image heavy sites do not show all the images at once. To see all the images, we need to scroll down many pages. Because WebKit decodes images only when they need to be painted, not all the images are decoded in page loading.

Fortunately, parallel image decoders seem to increase scrolling responsiveness because images are decoded asynchronously in the background while the main thread repaints. I don't have a way to measure this in numbers, but we could feel this in image heavy sites such as http://kiru.kr/kiru/html/singlelongopst.htm
Comment 14 Zoltan Horvath 2012-07-11 01:33:12 PDT
(In reply to comment #13)
> You can check out the updated test results in the following URL:
> 
> https://docs.google.com/spreadsheet/pub?key=0Ar2smwimcenMdGpnTEw2clZjSTNkbXNFNFM5dkYyRGc&output=html

The results look promising!

> A few notes:
> - Adjust the threshold from 512x512 to 300x300

This sounds cool, because I made earlier an investigation about used image sizes based on the first alexa top150 sites and I can say that the most images have lower than 512*512 dimension: https://bugs.webkit.org/show_bug.cgi?id=71555#c7

> - Measure first painting time to make sure all the images are decoded.

How do you measure first painting time?

> However, there is still limitation. Even image heavy sites do not show all the images at once. To see all the images, we need to scroll down many pages. Because WebKit decodes images only when they need to be painted, not all the images are decoded in page loading.

I think it's normal behavior, I see this as an advantage. If we are decoding all images including the never shown ones then we would make extra useless cpu cost on the worker thread.
Comment 15 Kwang Yul Seo 2012-07-11 01:58:35 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > You can check out the updated test results in the following URL:
> > 
> > https://docs.google.com/spreadsheet/pub?key=0Ar2smwimcenMdGpnTEw2clZjSTNkbXNFNFM5dkYyRGc&output=html
> 
> The results look promising!
> 
> > A few notes:
> > - Adjust the threshold from 512x512 to 300x300
> 
> This sounds cool, because I made earlier an investigation about used image sizes based on the first alexa top150 sites and I can say that the most images have lower than 512*512 dimension: https://bugs.webkit.org/show_bug.cgi?id=71555#c7

This is really good information. Thanks. We will experiment more with various thresholds.

> 
> > - Measure first painting time to make sure all the images are decoded.
> 
> How do you measure first painting time?

We measure the last call to Image::draw. This ensures that all the images inside the initial viewport are decoded and painted.

> 
> > However, there is still limitation. Even image heavy sites do not show all the images at once. To see all the images, we need to scroll down many pages. Because WebKit decodes images only when they need to be painted, not all the images are decoded in page loading.
> 
> I think it's normal behavior, I see this as an advantage. If we are decoding all images including the never shown ones then we would make extra useless cpu cost on the worker thread.

I do like the current lazy behavior of WebKit image decoding. I just wanted to mention that we also need to consider images outside the initial viewport because they are decoded by parallel image decoders when we scroll. These images are also decoded faster thanks to parallel image decoders!
Comment 16 noel gordon 2012-07-12 01:46:45 PDT
(In reply to comment #13)

> As mentioned above, parallel image decoders can't improve page loading time (or first paint time) dramatically if images are not cached, because it takes a long time to fetch large images and image decoding can be performed in the meantime (interleaved with loading). If images are already cached, then parallel image decoders shine! 

The abstract of your design document suggests you consistently outperform sequential image decoding.  Yet small images are excluded from threaded decodes for some reason, so perhaps that means that most web pages would be excluded?  Another result is that large image decodes are network bound, and also see no benefit.  I don't see how you can "consistently outperform" the sequential image decoders.  Perhaps the improvement is marginal at best?  It state that the cached image case "shines" (I don't know what that means) but it is the only case that benefits in some way.  Why should I care about that case?
Comment 17 Kwang Yul Seo 2012-07-12 02:43:23 PDT
(In reply to comment #16)
> (In reply to comment #13)
> 
> > As mentioned above, parallel image decoders can't improve page loading time (or first paint time) dramatically if images are not cached, because it takes a long time to fetch large images and image decoding can be performed in the meantime (interleaved with loading). If images are already cached, then parallel image decoders shine! 
> 
> The abstract of your design document suggests you consistently outperform sequential image decoding.  Yet small images are excluded from threaded decodes for some reason, so perhaps that means that most web pages would be excluded?  

Yes, small images are slow due to synchronization overhead. We are still experimenting with various threshold values. We've just found that the break even point is near 128x128. According to Zoltan's survey https://bugs.webkit.org/show_bug.cgi?id=71555#c7, it looks many sites can get performance benefits from parallel image decoders. Of course, the improvement is marginal near 128x128.

>Another result is that large image decodes are network bound, and also see no benefit.  I don't see how you can "consistently outperform" the sequential image decoders.  Perhaps the improvement is marginal at best?  

Sadly, yes. However, I see other use cases. What about Offline Web Applications or Widgets? Because images are locally stored, parallel image decoders can speed up initial loading time.

By "consistently outperform", I meant parallel image decoders are used only when they can do better based on various criteria. Otherwise, serial image decoders are used instead with no switching overhead.

>It state that the cached image case "shines" (I don't know what that means) but it is the only case that benefits in some way.  Why should I care about that case?

We do understand that parallelization usually introduces massive code changes with only marginal performance improvements. Our test results show that parallel image decoders are faster in some cases, and at least not slower in most cases.

We are trying to further improve performance and reduce the complexity. So I hope we could discuss this in a more positive direction.
Comment 18 Dongseong Hwang 2012-07-13 00:12:37 PDT
(In reply to comment #10)
> One of the difficulties of parallel image decoding is it is not always safe to call the observers in main thread, because that can trigger JS execution.
> 
> Please take a look at PageGroupLoadDeferrer and relevant code. For example, when a JS modal dialog is being showed, no JS event is supposed to be dispatched.
> 
> The other side of this problem is that one image source can have different observers from different pages, so it cannot get a pointer to a page. One solution may be making the load defer state globally accessible.

Thank you for good information.

You said that "it is not always safe to call the observers in main thread".
By "to call the observers", do you mean CachedImageClient::imageChanged?
If so, could you explain where imageChanged triggers JS execution? I can't find the code.
Comment 19 Yong Li 2012-07-13 07:27:23 PDT
(In reply to comment #18)
> (In reply to comment #10)
> > One of the difficulties of parallel image decoding is it is not always safe to call the observers in main thread, because that can trigger JS execution.
> > 
> > Please take a look at PageGroupLoadDeferrer and relevant code. For example, when a JS modal dialog is being showed, no JS event is supposed to be dispatched.
> > 
> > The other side of this problem is that one image source can have different observers from different pages, so it cannot get a pointer to a page. One solution may be making the load defer state globally accessible.
> 
> Thank you for good information.
> 
> You said that "it is not always safe to call the observers in main thread".
> By "to call the observers", do you mean CachedImageClient::imageChanged?
> If so, could you explain where imageChanged triggers JS execution? I can't find the code.

Never mind for this deferrer thing. I got that impression because CachedResourceClient::notifiyFinished() can trigger a JS event. But it doesn't seem to be an issue with you guys' solution, as you don't hack the data delivery path (CachedImage::data() or Image::setData()). Sorry for the false alarm. Go ahead.
Comment 20 Kwang Yul Seo 2012-07-17 16:20:41 PDT
Created attachment 152865 [details]
Parallel image decoders (not for review)

Rebased to the latest revision.
Comment 21 WebKit Review Bot 2012-07-17 19:34:58 PDT
Attachment 152865 [details] did not pass style-queue:

Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Source/WTF/ChangeLog', u'Source/WTF/GNUmak..." exit_code: 1
Source/WebCore/platform/graphics/ImageSource.cpp:289:  Should have only a single space after a punctuation in a comment.  [whitespace/comments] [5]
Total errors found: 1 in 77 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 22 Build Bot 2012-07-17 19:44:47 PDT
Comment on attachment 152865 [details]
Parallel image decoders (not for review)

Attachment 152865 [details] did not pass mac-ews (mac):
Output: http://queues.webkit.org/results/13280329
Comment 23 Build Bot 2012-07-17 19:49:01 PDT
Comment on attachment 152865 [details]
Parallel image decoders (not for review)

Attachment 152865 [details] did not pass win-ews (win):
Output: http://queues.webkit.org/results/13280331
Comment 24 WebKit Review Bot 2012-07-17 19:55:00 PDT
Comment on attachment 152865 [details]
Parallel image decoders (not for review)

Attachment 152865 [details] did not pass chromium-ews (chromium-xvfb):
Output: http://queues.webkit.org/results/13274340
Comment 25 Hin-Chung Lam 2012-07-17 21:17:58 PDT
Thanks for sharing the patch.

As mentioned before we have the same goal to allow the notion of asynchronous image decoding. Under the hood it can be parallel image decoding like your patch does, or in the Chromium case we'll defer the image decoding for a later time. We have a common goal here.

I have some questions I would appreciate very much if you can help me understand better your patch.

If I understand your code correctly the use of RetainedModeBitmapImage is to detect drawing, metadata calls while image decoding can be done asynchronously, by using a counter and later call to ensureFrameIsCached(), requestFrameAtIndex() will try to parallel image decoding.

One part I have problem understanding is when is an image parallel decoded? toRetainedModeBitmapImageIfPossible() seems to be called only when an BitmapImage is painted to a GraphicsContext, yet method call to BitmapImage::frameDurationAtIndex() and other methods in BitmapImage also seems to trigger parallel image decoding.

Q1: Will BitmapImage::frameDurationAtIndex(), orientationAtIndex() and these metadata friends trigger parallel image decoding?
Q2: Are these code paths currently used? Or do you expect all parallel image to be triggered by drawing a BitmapImage into a GraphicsContext?
Comment 26 Kwang Yul Seo 2012-07-17 21:53:10 PDT
(In reply to comment #25)
> One part I have problem understanding is when is an image parallel decoded? toRetainedModeBitmapImageIfPossible() seems to be called only when an BitmapImage is painted to a GraphicsContext, yet method call to BitmapImage::frameDurationAtIndex() and other methods in BitmapImage also seems to trigger parallel image decoding.
> 
> Q1: Will BitmapImage::frameDurationAtIndex(), orientationAtIndex() and these metadata friends trigger parallel image decoding?

Yes, all metadata friends trigger parallel image decoding if they are used by RetainedModeBitmapImage. For example, RenderLayerBacking checks if the current frame is completed as in the following:

1083     // We have to wait until the image is fully decoded before setting it on the layer.
1084     if (image->isBitmapImage() && !toRetainedModeBitmapImage(image)->currentFrameIsComplete())
1085         return;

This can trigger parallel image decoding.

> Q2: Are these code paths currently used? Or do you expect all parallel image to be triggered by drawing a BitmapImage into a GraphicsContext?

Yes, as you can see in the example above. However, it seems BitmapImage::frameDurationAtIndex() and BitmapImage::orientationAtIndex() are not used to trigger parallel image decoding in the current code paths.

If you wrap a bitmap with RetainedModeBitmapImage, it can trigger parallel image decoding. While parallel image decoding is in progress, if other code paths call BitmapImage::frameDurationAtIndex() or other metadata friends without wrapping the bitmap with RetainedModeBitmapImage, parallel image decoding is cancelled and image is decoded serially.
Comment 27 Hin-Chung Lam 2012-07-17 22:34:54 PDT
Comment on attachment 151429 [details]
Parallel image decoders (not for review)

View in context: https://bugs.webkit.org/attachment.cgi?id=151429&action=review

> Source/WebCore/platform/graphics/BitmapImage.cpp:272
> +        m_isCachingFrame = true;

Would appreciate if you can explain more about the use of this flag. I can't find other use of this member variable outside of this method.
Comment 28 Kwang Yul Seo 2012-07-17 22:38:48 PDT
(In reply to comment #26)

It seems RenderingLayerBacking is the only place where one of metadata friends is used to trigger parallel image decoding in the current code paths.

We've realized that we don't need to use parallel image decoding for these metadata friends.

So we will change our implementation to use parallel image decoders only when we need to paint images.
Comment 29 Kwang Yul Seo 2012-07-17 22:46:27 PDT
(In reply to comment #27)
> (From update of attachment 151429 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=151429&action=review
> 
> > Source/WebCore/platform/graphics/BitmapImage.cpp:272
> > +        m_isCachingFrame = true;
> 
> Would appreciate if you can explain more about the use of this flag. I can't find other use of this member variable outside of this method.

After we changed the condition for needToCache as in the following:

270     bool needToCache = index >= m_frames.size() || !m_frames[index].m_frame || !m_frames[index].m_isComplete;

cacheFrame(index) in BitmapImage::ensureFrameIsCached() now can recursively call BitmapImage::ensureFrameIsCached(). For example, 

BitmapImage::didReceiveFrameAtIndex
-> BitmapImage::checkForSolidColor (ImageQt.cpp)
-> BitmapImage::frameAtIndex
-> BitmapImage::ensureFrameIsCached

So we added m_isCachingFrame as a guard to return in this case.
Comment 30 Hin-Chung Lam 2012-07-18 11:54:19 PDT
(In reply to comment #28)
> (In reply to comment #26)
> 
> It seems RenderingLayerBacking is the only place where one of metadata friends is used to trigger parallel image decoding in the current code paths.
> 
> We've realized that we don't need to use parallel image decoding for these metadata friends.
> 
> So we will change our implementation to use parallel image decoders only when we need to paint images.

That makes sense. My observation of the code is that those metadata functions are supposed to return default values when a frame is not ready (being not downloaded) as of todays, so I think they probably don't need any handling.

If the starting point for parallel image decoding is just when painting a BitmapImage to GraphicsContext, I wonder if it will make the intention clearer something like:

drawImage(xxx, bool prepareImageAsynchronouslyIfPossible = false);

Under the hood GraphicsContext all your changes including ParallelImageDecoder remains the same. My point is this eliminates using a type (RetainedModeBitmapImage) to denote the desired behavior of image decoding caused by painting, instead have this built-in to the painting methods.

One other tiny performance benefit of this is when GraphicsContext detects the frame is not ready and skip painting an empty frame.

What do you think?
Comment 31 Kwang Yul Seo 2012-07-18 16:12:56 PDT
(In reply to comment #30)
> (In reply to comment #28)
> > (In reply to comment #26)
> > 
> > It seems RenderingLayerBacking is the only place where one of metadata friends is used to trigger parallel image decoding in the current code paths.
> > 
> > We've realized that we don't need to use parallel image decoding for these metadata friends.
> > 
> > So we will change our implementation to use parallel image decoders only when we need to paint images.
> 
> That makes sense. My observation of the code is that those metadata functions are supposed to return default values when a frame is not ready (being not downloaded) as of todays, so I think they probably don't need any handling.
> 
> If the starting point for parallel image decoding is just when painting a BitmapImage to GraphicsContext, I wonder if it will make the intention clearer something like:
> 
> drawImage(xxx, bool prepareImageAsynchronouslyIfPossible = false);
> 
> Under the hood GraphicsContext all your changes including ParallelImageDecoder remains the same. My point is this eliminates using a type (RetainedModeBitmapImage) to denote the desired behavior of image decoding caused by painting, instead have this built-in to the painting methods.
> 
> One other tiny performance benefit of this is when GraphicsContext detects the frame is not ready and skip painting an empty frame.
> 
> What do you think?

That depends on how we refactor RenderingLayerBacking. In this case, we retrieve a NativeImagePtr from RetainedModeBitmapImage. So we can't simply delegate the decision to GraphicsContext.

I will let you know once we decide how we refactor RenderLayerBacking.
Comment 32 Hin-Chung Lam 2012-07-18 18:55:49 PDT
> That depends on how we refactor RenderingLayerBacking. In this case, we retrieve a NativeImagePtr from RetainedModeBitmapImage. So we can't simply delegate the decision to GraphicsContext.
> 
> I will let you know once we decide how we refactor RenderLayerBacking.

ah! That's right I forgot RenderLayerBacking. RenderLayerBacking does present an interesting case. In your code I understand now these two operations does actually trigger parallel decoding:

RetainedModeBitmapImage::currentFrameIsComplete() and
RetainedModeBitmapImage::nativeImageForCurrentFrame()

Without this extra type calling nativeImageForCurrentFrame() would cause a synchronous decoding for RenderLayerBacking.

Just a thought here, actually adding the notion of asynchronous decoding to BitmapImage, e.g.

BitmapImage::prepareCurrentFrameAsynchronouslyIfPossible()
BitmapImage::nativeImageForCurrentFrame(); // returns 0 if async frame requested but not ready
BitmapImage::currentFrameIsComplete(); // just a getter to see if current frame is ready

prepareCurrentFrameAsynchronouslyIfPossible() is similar to your retained mode counter but have it built-in to BitmapImage.
Comment 33 Kwang Yul Seo 2012-07-19 01:30:31 PDT
(In reply to comment #32)
> ah! That's right I forgot RenderLayerBacking. RenderLayerBacking does present an interesting case. In your code I understand now these two operations does actually trigger parallel decoding:
> 
> RetainedModeBitmapImage::currentFrameIsComplete() and
> RetainedModeBitmapImage::nativeImageForCurrentFrame()
> 
> Without this extra type calling nativeImageForCurrentFrame() would cause a synchronous decoding for RenderLayerBacking.
> 
> Just a thought here, actually adding the notion of asynchronous decoding to BitmapImage, e.g.
> 
> BitmapImage::prepareCurrentFrameAsynchronouslyIfPossible()
> BitmapImage::nativeImageForCurrentFrame(); // returns 0 if async frame requested but not ready
> BitmapImage::currentFrameIsComplete(); // just a getter to see if current frame is ready
> 
> prepareCurrentFrameAsynchronouslyIfPossible() is similar to your retained mode counter but have it built-in to BitmapImage.

Thanks. Your idea sounds good. However, it seems we have a few problems.

In addition to the methods you mentioned, currently the following metadata query methods of BitmapImage trigger (parallel) image decoding:

float frameDurationAtIndex(size_t);
bool frameHasAlphaAtIndex(size_t);
bool frameIsCompleteAtIndex(size_t);
ImageOrientation orientationAtIndex(size_t) const;

If you call one of these methods without wrapping the bitmap with RetainedModeBitmapImage, it cancels on-going parallel image decoding and start decoding again serially. If a client wants to use parallel image decoding, it must use these metadata methods very carefully.


We do want only NativeImagePtr createFrameAtIndex(size_t) to trigger image decoding, so we checked if it is possible to get the metadata without full decoding as we do with isSizeAvailable.

1) float frameDurationAtIndex(size_t)
This is available only when GIF decoding is done. We can't get the duration simply parsing the header as in the case of size.

2) bool frameIsCompleteAtIndex(size_t)
This represents the current status of image decoding. (I am not sure if we can call this as metadata.) If we return this without decoding, it can return false even when allDataReceived == true.

3) ImageOrientation orientationAtIndex(size_t) const
ImageSource always returns DefaultImageOrientation. So this is available before decoding.

4) bool frameHasAlphaAtIndex(size_t)
This is most difficult to know before decoding. We need to decode at least one row. As ImageDecoder::isSizeAvaliable() just decodes the header and lets us know the size, we can add ImageDecoder::frameHasAlphaAtIndex(size_t) and decode one row. However, this is still problematic.

ImageSource unconditionally sets alpha to true before decoding as in the following:

bool ImageSource::frameHasAlphaAtIndex(ImageFrame* buffer)
{
    // When a frame has not finished decoding, always mark it as having alpha.
    // Ports that check the result of this function to determine their
    // compositing op need this in order to not draw the undecoded portion as
    // black.
    // TODO: Perhaps we should ensure that each individual decoder returns true
    // in this case.
    return !frameIsCompleteAtIndex(buffer) || buffer->hasAlpha();
}

If ImageSource::frameHasAlphaAtIndex(size_t) does not trigger full image decoding, frameHasAlphaAtIndex can't know if decoding is complete or not. If GraphicsContext3D uses BitmapImage as in the following, it is problematic.

Assumption: allDataReceived == true.

bool hasAlpha = image->currentFrameHasAlpha(); // <- may return true.
// do something with hasAlpha.
image->nativeImagePtr(); // <- image's actual alpha can be false.



As you suggested, we can implement these metadata friends as simple getters which do not trigger image decoding, but then we must be very careful in using these getters.

That's the reason why we introduced RetainedModeBitmapImage class. If a client wants to use BitmapImage asynchronously, we can simply wrap it with RetainedModeBitmapImage and everything is taken care of. We know this is not the best solution, but we couldn't find a better solution.

Do you have any idea on how we can make these metadata queries simple getters?
Comment 34 Hin-Chung Lam 2012-07-19 20:54:54 PDT
> As you suggested, we can implement these metadata friends as simple getters which do not trigger image decoding, but then we must be very careful in using these getters.
> 
> That's the reason why we introduced RetainedModeBitmapImage class. If a client wants to use BitmapImage asynchronously, we can simply wrap it with RetainedModeBitmapImage and everything is taken care of. We know this is not the best solution, but we couldn't find a better solution.
> 
> Do you have any idea on how we can make these metadata queries simple getters?

Thanks for showing me the thoughts that went into making this decision, this makes much sense to me now. Having review the material I agree this seems like the best possible solution, without having to refactor all usage of BitmapImage.

My gut feeling still tells me by going though a hard refactoring we can eventually remove the need of RetainedModeBitmapImage. However I would agree this is a very good intermediate step.
Comment 35 Peter Kasting 2012-07-19 22:35:28 PDT
I know this is tangential, but while you're pondering the image decoding pipeline and any eventual refactorings, keep in mind also that it would be nice someday to be able to tell the image decoders to decode directly to a scaled-down output buffer (for large images that are not displayed at 1:1 on the page, a surprisingly common case).  We've never had this because it requires plumbing the desired output size over to the decoders and then making decisions about how to cache the results (do we decode full-size, downsample, and cache both? Or for some image formats can we efficiently decode directly to a scaled-down buffer, cache that, but retain sufficient metadata to correctly handle later requests for other sizes?  etc.)

I don't want to distract from the goal at hand here -- just figuring this would be a good audience to mention some of these concerns to :)
Comment 36 Hin-Chung Lam 2012-07-19 23:16:02 PDT
(In reply to comment #35)
> I know this is tangential, but while you're pondering the image decoding pipeline and any eventual refactorings, keep in mind also that it would be nice someday to be able to tell the image decoders to decode directly to a scaled-down output buffer (for large images that are not displayed at 1:1 on the page, a surprisingly common case).  We've never had this because it requires plumbing the desired output size over to the decoders and then making decisions about how to cache the results (do we decode full-size, downsample, and cache both? Or for some image formats can we efficiently decode directly to a scaled-down buffer, cache that, but retain sufficient metadata to correctly handle later requests for other sizes?  etc.)
> 
> I don't want to distract from the goal at hand here -- just figuring this would be a good audience to mention some of these concerns to :)

It is great that the vision of optimizing downscaled image decoding pipeline is shared by someone else. In fact this is what got me started on this work. I am very happy to discuss this matter with the right audience on a separate thread. :)
Comment 37 Hin-Chung Lam 2012-07-22 19:11:21 PDT
I am in the process of going though all places that uses BitmapImage synchronously, I would like to continue the discuss of asynchronous image decoding interface with a concrete patch we draft, please be patient for our results. I really appreciate the discussion we had and it certainly helped us to understand the problem better.

One separate issue is about enabling this parallel image decoding feature. Is there a way we can disable it and make sure that code path for serial image decoding won't be affected with your change?

Thanks.
Comment 38 Kwang Yul Seo 2012-07-23 01:43:42 PDT
(In reply to comment #34)
> My gut feeling still tells me by going though a hard refactoring we can eventually remove the need of RetainedModeBitmapImage. However I would agree this is a very good intermediate step.

Yes, we need to get rid of RetainedModeBitmapImage eventually by refactoring all usage of BitmapImage. In the meantime, it is a necessary evil.
Comment 39 Kwang Yul Seo 2012-07-23 01:59:40 PDT
(In reply to comment #37)
> One separate issue is about enabling this parallel image decoding feature. Is there a way we can disable it and make sure that code path for serial image decoding won't be affected with your change?

No, it is always enabled because we dynamically switch to serial image decoding when parallel image decoding seems unprofitable. We can add a runtime flag to enable/disable it. Currently, if ImageSource::shouldDecodeParallel() returns false, code path for serial image decoding won't be affected at all.
Comment 40 Hin-Chung Lam 2012-07-26 19:25:06 PDT
Created attachment 154812 [details]
Draft patch for refactoring in BitmapImage and Image

I have drafted a patch for the refactoring of BitmapImage and Image to show the direction we want to do, here's some highlights of this patch and particularly the difference, I would like to use this to continue our discussion of refactoring.

Here's the highlights of the design / refactoring:

1. Getters for metadata go directly into the decoder without decoding the entire frame.

This is different than the design you suggested. We really want to restrict the start of decoding to only one case, i.e. when nativeImageForCurrentFrame() is called. This is important for us because we don't want a metadata getter to start decoding and we can then have a clear indication when we can defer decoding.

This is also important to performance because decoder should do the best effort to fetch the metadata without decoding, this can be implemented for frameIsCompleteAtIndex(), frameDurationAtIndex(), frameOrientationAtIndex(). The current implementation of always decoding the entire frame is not optimal.

2. Specifically call out the desire of asynchronous image decoding in BitmapImage.

I added methods to BitmapImage to indicate the desire to use asynchronous decoding, i.e. BitmapImage::setUseAsynchronousDecoding(). I would like to have usage of BitmapImage at each site explicit call out the intention, this is also important because a usage of BitmapImage can disable asynchronous decoding and this case can be handled explicitly.

So an example of usage of this would be:

if (image->isBitmapImage())
  static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(true);

draw(image, ...);

And if the usage of the image requires synchronous decoding like that of a canvas, then:

if (image->isBitmapImage())
  static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(false);

draw(image, ...);

I find it important to identify the intention of use at each site. And having this method in BitmapImage can make things more explicit.

3. Limit the asynchronous interface only to ImageSource

I noticed in your patch that ImageSource and ImageDecoder are both asynchronous. I would like to reduce this down to only ImageSource. In fact because ImageSource.cpp is only used by non-cg port, we can just implement a parallel decoding mode there and keep ImageDecoder always synchronous.

4. Use of default state before frame is fully decoded is acceptable

We talked about the worries of returning default states before an image is fully decoded, I would like to go into details about each of these metadata getters and discuss why it is okay to return a default value:

a. frameOrientationAtIndex()

This method is current *always* returning a default value for all formats. Even if this is properly this can be decoded from the EXIF data and so calling directly into the decoder can return this correctly without decoding the full frame, given that this is implemented for WebKit's jpeg decoder.

b. frameIsCompleteAtIndex()

This method is to return if a frame is full decoded. This doesn't contradict with asynchronous decoding at all, when asynchronous decoding is in use and frame is not fully decoded the frame is always incomplete. We can also see if we have enough bytes to determine of a frame is complete.

Also note there is only two main usages of this method:

i. GIF animation

For the case of asynchronous decoding, a frame is not fully decoded is always not complete and this doesn't contradict with the animation strategy.

ii. Extract image data for GraphicsContext3D for a fully loaded image

Use of frameIsCompleteAtIndex() is really not necessary there, they are used incorrectly to see if image can be decoded.

Ultimately this method should be protected and shouldn't be exposed. I think further refactoring of GraphicsContext3DSkia and GraphicsContext3DCario will elimate the use of this method. Once this method is fully internal to BitmapImage there is not problem for asynchronous decoding.

c. frameHasAlphaAtIndex()

Note that current WebKit code already returns hasAlpha = true for an incomplete and naturally a partially decoded image always has alpha. Again this method is only meaningful only if the image is fully decoded, otherwise hasAlpha should always be true anyway.

d. frameDurationAtIndex()

This is only meaningful for GIF, and for GIF this can be determined without decoding the frame. Also note that this is only meaningful if a frame is complete (because we don't care about duration until a frame is completely loaded / decoded). Grabbing the value of this attribute is the responsibility of the GIF decoder and hence should be delegated directly to the decoder.

5. Metadata state mismatch

We discussed about having a mismatch of metadata state, for example:

image->currentFrameHasAlpha() // returns true
... do something else ..
NativeImage* nativeImage = image->nativeImageForCurrentFrame() // nativeImage doesn't have alpha

This actually won't be the case, because of the following:

a. Suppose when we call image->currentFrameHasAlpha() the image has not been decoded, this method will return true and later we call image->nativeImageForCurrentFrame(), this nativeImage will either be null or contain an empty image because decoding is just going to start, ans it *will* have alpha.

b. Suppose when we call image->currentFrameHasAlpha() the image is partially decoded, this method will return true, and again image->nativeImageForCurrentFrame() will not decode anything synchronously and this doesn't change the alpha state.

c. Suppose when we call image->currentFrameHasAlpha() the image is fully decoded is cached, then we return the correct value and image->nativeImageForCurrentFrame() will not lead to a different result.

Please let me know what you think about the approach in the patch posted. I think the next thing I'll do is to review all places where nativeImageForCurrentFrame is called and put whether asynchronous decoding should be used at each spot. My gut felling is that asynchronous decoding is the common case while synchronous decoding is rare.
Comment 41 Kwang Yul Seo 2012-07-26 21:22:32 PDT
(In reply to comment #40)
Thanks for your efforts. The patch looks reasonable to me. Dongsung will comment on the details.
Comment 42 Dongseong Hwang 2012-07-26 23:11:10 PDT
Thanks for your efforts. I agreed on most of your approach.
I have some questions and concerns.

(In reply to comment #40)
> Created an attachment (id=154812) [details]
> Draft patch for refactoring in BitmapImage and Image
> 
> I have drafted a patch for the refactoring of BitmapImage and Image to show the direction we want to do, here's some highlights of this patch and particularly the difference, I would like to use this to continue our discussion of refactoring.
> 
> Here's the highlights of the design / refactoring:
> 
> 1. Getters for metadata go directly into the decoder without decoding the entire frame.
> 
> This is different than the design you suggested. We really want to restrict the start of decoding to only one case, i.e. when nativeImageForCurrentFrame() is called. This is important for us because we don't want a metadata getter to start decoding and we can then have a clear indication when we can defer decoding.
> 
> This is also important to performance because decoder should do the best effort to fetch the metadata without decoding, this can be implemented for frameIsCompleteAtIndex(), frameDurationAtIndex(), frameOrientationAtIndex(). The current implementation of always decoding the entire frame is not optimal.
> 

Absolutely, it will be easy to use BitmapImage and implement async decoders if only nativeImageForCurrentFrame() can cause full image decoding.


> 2. Specifically call out the desire of asynchronous image decoding in BitmapImage.
> 
> I added methods to BitmapImage to indicate the desire to use asynchronous decoding, i.e. BitmapImage::setUseAsynchronousDecoding(). I would like to have usage of BitmapImage at each site explicit call out the intention, this is also important because a usage of BitmapImage can disable asynchronous decoding and this case can be handled explicitly.
> 
> So an example of usage of this would be:
> 
> if (image->isBitmapImage())
>   static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(true);
> 
> draw(image, ...);
> 
> And if the usage of the image requires synchronous decoding like that of a canvas, then:
> 
> if (image->isBitmapImage())
>   static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(false);
> 
> draw(image, ...);
> 
> I find it important to identify the intention of use at each site. And having this method in BitmapImage can make things more explicit.
> 

If we have setUseAsynchronousDecoding, we must use it with nativeImageForCurrentFrame() too.
For examples,

 if (image->isBitmapImage())
   static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(false);
 
 image->nativeImageForCurrentFrame();


There are so many sites which call nativeImageForCurrentFrame() and there are only several (maybe 4) sites which need to call static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(true).
BitmapImage is used for various purposes such as <image>, backingStore, theme images and etc. I think it is hard for all developers to remember calling setUseAsynchronousDecoding(false) before draw(image, ...) and nativeImageForCurrentFrame().
So, I think the default policy should be to decode synchronously. When we want async decoding, we need to give the hint to BitmapImage.
For examples,

 draw(RetainedModeBitmapImage(image), ...);

I think you want to ask why I put RetainedModeBitmapImage in here again.

In the following situation, BitmapImage can not determine whether to decode synchronously or asynchronously.

 #1 RenderImage calls

 if (image->isBitmapImage())
   static_cast<BitmapImage*>(image)->setUseAsynchronousDecoding(true);
 
 draw(image, ...);


 #2 CanvasRenderingContext2D calls with the same image.

 draw(image, ...);  <- We should do following jobs: 1. cancel async image decoding, 2. decode the image synchronously, 3. notify CachedImageClient::imageChanged(), 4. draw the image in succession.
                              BUT we can not know this calling is from CanvasRenderingContext2D.

I suggested RetainedModeBitmapImage instead of setter because of it.


> 3. Limit the asynchronous interface only to ImageSource
> 
> I noticed in your patch that ImageSource and ImageDecoder are both asynchronous. I would like to reduce this down to only ImageSource. In fact because ImageSource.cpp is only used by non-cg port, we can just implement a parallel decoding mode there and keep ImageDecoder always synchronous.
> 

I think your concern is in that ImageSource delegates ImageDecoder to ParallelImageDecoder. Maybe because ImageSource must directly call ImageDecoder::frameHasAlphaAtIndex and other metadata query functions.
If ParallelImageDecoder creates its own ImageDecoder, it will be ok.


> 4. Use of default state before frame is fully decoded is acceptable
> 
> We talked about the worries of returning default states before an image is fully decoded, I would like to go into details about each of these metadata getters and discuss why it is okay to return a default value:
> 
> a. frameOrientationAtIndex()
> 
> This method is current *always* returning a default value for all formats. Even if this is properly this can be decoded from the EXIF data and so calling directly into the decoder can return this correctly without decoding the full frame, given that this is implemented for WebKit's jpeg decoder.
> 
> b. frameIsCompleteAtIndex()
> 
> This method is to return if a frame is full decoded. This doesn't contradict with asynchronous decoding at all, when asynchronous decoding is in use and frame is not fully decoded the frame is always incomplete. We can also see if we have enough bytes to determine of a frame is complete.
> 
> Also note there is only two main usages of this method:
> 
> i. GIF animation
> 
> For the case of asynchronous decoding, a frame is not fully decoded is always not complete and this doesn't contradict with the animation strategy.
> 
> ii. Extract image data for GraphicsContext3D for a fully loaded image
> 
> Use of frameIsCompleteAtIndex() is really not necessary there, they are used incorrectly to see if image can be decoded.
> 
> Ultimately this method should be protected and shouldn't be exposed. I think further refactoring of GraphicsContext3DSkia and GraphicsContext3DCario will elimate the use of this method. Once this method is fully internal to BitmapImage there is not problem for asynchronous decoding.
> 
> c. frameHasAlphaAtIndex()
> 
> Note that current WebKit code already returns hasAlpha = true for an incomplete and naturally a partially decoded image always has alpha. Again this method is only meaningful only if the image is fully decoded, otherwise hasAlpha should always be true anyway.
> 
> d. frameDurationAtIndex()
> 
> This is only meaningful for GIF, and for GIF this can be determined without decoding the frame. Also note that this is only meaningful if a frame is complete (because we don't care about duration until a frame is completely loaded / decoded). Grabbing the value of this attribute is the responsibility of the GIF decoder and hence should be delegated directly to the decoder.
> 
> 5. Metadata state mismatch
> 
> We discussed about having a mismatch of metadata state, for example:
> 
> image->currentFrameHasAlpha() // returns true
> ... do something else ..
> NativeImage* nativeImage = image->nativeImageForCurrentFrame() // nativeImage doesn't have alpha
> 
> This actually won't be the case, because of the following:
> 
> a. Suppose when we call image->currentFrameHasAlpha() the image has not been decoded, this method will return true and later we call image->nativeImageForCurrentFrame(), this nativeImage will either be null or contain an empty image because decoding is just going to start, ans it *will* have alpha.

It is true when we need to decode an image asynchronously. If we need to decode synchronously and all data were received, image->nativeImageForCurrentFrame() must return a fully decoded nativeImage and the nativeImage can have alpha.
In the synchronous decoding case like CanvasRenderingContext2D, we can not return either null or an empty image because we do not have a chance to draw again after decoding is complete.

Other metadata friends have similar problems. For examples,

 image->frameIsCompleteAtIndex() // returns false
 ... do something else ..
 NativeImage* nativeImage = image->nativeImageForCurrentFrame() // nativeImage can be complete when decoding is synchronous.

It is the most tricky problem, and I have not found a promising solution yet. I want to hear your opinion.


> 
> b. Suppose when we call image->currentFrameHasAlpha() the image is partially decoded, this method will return true, and again image->nativeImageForCurrentFrame() will not decode anything synchronously and this doesn't change the alpha state.
> 

Ditto.


> c. Suppose when we call image->currentFrameHasAlpha() the image is fully decoded is cached, then we return the correct value and image->nativeImageForCurrentFrame() will not lead to a different result.
> 

Yes, exactly.


> Please let me know what you think about the approach in the patch posted. I think the next thing I'll do is to review all places where nativeImageForCurrentFrame is called and put whether asynchronous decoding should be used at each spot. My gut felling is that asynchronous decoding is the common case while synchronous decoding is rare.

As I mentioned earlier, if our default policy is serial image decoding, it may not be necessary to review all the places where nativeImageForCurrentFrame is called.
I think most cases that WebKit users encounter are covered by asynchronous decoding, but in the WebKit code there are only 4 sites that we can change to decode asynchronously: RenderImage, RenderSVGImage, RenderObject (css border, mask and background image) and RenderLayerBacking (for <image>).
It is because the 4 cases have the ability to receive a decoding complete notification via CachedImageClient::imageChanged() and draw the image again.


I am very glad that there are people who dig this subject like you. I think we almost grasp how to do it.
Comment 43 Dongseong Hwang 2012-07-26 23:29:51 PDT
I encountered "You are not authorized to edit attachment 154812 [details]."
So, I commented it by hand. :)

(From update of attachment 154812 [details])
View in context: https://bugs.webkit.org/attachment.cgi?id=154812&action=review

> Source/WebCore/ChangeLog:140
> 2012-07-25  Alpha Lam  <hclam@chromium.org>
>
>        Remove destroyDecodedData() and decodedSize() from Image.

It is good, but it is not related to async image decoding.
How about patching this previously?


> Source/WebCore/platform/graphics/BitmapImage.h:95
>    bool m_hasAlpha : 1;
>    bool m_haveDefinedOrientation : 1;
>    bool m_haveDefinedDuration : 1;
>    bool m_haveDefinedIsComplete : 1;
>    bool m_haveDefinedHasAlpha : 1;

What is the purpose of these new variables?
Comment 44 Kwang Yul Seo 2012-07-27 01:20:40 PDT
I think we also need to talk more on testing. Because parallel image decoding (or asynchronous image decoding) extensively changes the behavior, we must have pretty good test coverage. But it seems image decoding has been historically under-tested due to lack of tools. I filed Bug 92470 to discuss this issue.
Comment 45 Hin-Chung Lam 2012-07-30 11:50:47 PDT
(In reply to comment #44)
> I think we also need to talk more on testing. Because parallel image decoding (or asynchronous image decoding) extensively changes the behavior, we must have pretty good test coverage. But it seems image decoding has been historically under-tested due to lack of tools. I filed Bug 92470 to discuss this issue.

Thanks for filing the bug on testing.

I have a refactoring patch in mind that will address the issues raised by Dongsung, in particular the use of metadata functions.

My goal of this patch is to remove currentFrameHasAlpha() and frameHasAlphaAtIndex() from Image. This is a tricky task but I think is achievable. Doing so would leave most metadata functions protected / internal to BitmapImage and I shall have better answers to your questions.

I'll post a patch in a couple days.
Comment 46 Dongseong Hwang 2012-07-30 18:56:10 PDT
(In reply to comment #45)
> 
> My goal of this patch is to remove currentFrameHasAlpha() and frameHasAlphaAtIndex() from Image. This is a tricky task but I think is achievable. Doing so would leave most metadata functions protected / internal to BitmapImage and I shall have better answers to your questions.
> 

I think making most metadata functions protected / internal to BitmapImage is the most appropriate solution.

I also think we should remove the explicit metadata query methods when implementing parallel image decoders.
There may be two approaches to do this.

1. Remove all metadata query methods and rename BitmapImage::nativeImageForCurrentFrame to BitmapImage::currnetFrame, and make the function return the current frame struct.
2. Remove all metadata query methods and call platform specific metadata query functions using nativeImagePtr.

Both causes a big refactoring. So, I postponed this refactoring.

I appreciate you are doing this refactoring.
Comment 47 Hin-Chung Lam 2012-07-31 11:39:31 PDT
> I think making most metadata functions protected / internal to BitmapImage is the most appropriate solution.
> 
> I also think we should remove the explicit metadata query methods when implementing parallel image decoders.
> There may be two approaches to do this.

Agreed. Since we're doing this (asynchronous image decoding / parallel image decoding) we should do the refactoring to best fit the purpose.

> 
> 1. Remove all metadata query methods and rename BitmapImage::nativeImageForCurrentFrame to BitmapImage::currnetFrame, and make the function return the current frame struct.
> 2. Remove all metadata query methods and call platform specific metadata query functions using nativeImagePtr.
> 
> Both causes a big refactoring. So, I postponed this refactoring.
> 
> I appreciate you are doing this refactoring.

I'm going for route (2).

My goal is to remove metadata queries from Image/BitmapImage for all non-CG ports. Since CG uses their own ImageSource and image decoders we should keep those metadata queries intact.

Removing currentFrameHasAlpha/frameHasAlphaAtIndex is straightforward for Skia ports. Generally this is ImageDecoder already sets this information to ImageFrame::setHasAlpha and ultimately set to the platform nativeImagePtr, so this information should well be in nativeImagePtr.

The real tricky part is the combined use of currentFrameHasAlpha and frameHasAlphaAtIndex with GraphicsContext3D::extractImageData. This is used in texmap where an image is represented by an array of bytes and not nativeImagePtr. Fortunately there's very little use of this and can be fixed.

Once currentFrameHasAlpha and frameHasAlphaAtIndex is unplugged then frameDurationAtIndex and frameIsCompleteAtIndex are both internal to BitmapImage and only meaningful to GIF, fixing that to work with asynchronous image decoding will have to do with refactoring the GIF image decoder. Again this is doable.
Comment 48 Dongseong Hwang 2012-07-31 19:39:08 PDT
(In reply to comment #47)
> My goal is to remove metadata queries from Image/BitmapImage for all non-CG ports. Since CG uses their own ImageSource and image decoders we should keep those metadata queries intact.
> 
> Removing currentFrameHasAlpha/frameHasAlphaAtIndex is straightforward for Skia ports. Generally this is ImageDecoder already sets this information to ImageFrame::setHasAlpha and ultimately set to the platform nativeImagePtr, so this information should well be in nativeImagePtr.
> 
> The real tricky part is the combined use of currentFrameHasAlpha and frameHasAlphaAtIndex with GraphicsContext3D::extractImageData. This is used in texmap where an image is represented by an array of bytes and not nativeImagePtr. Fortunately there's very little use of this and can be fixed.
> 
> Once currentFrameHasAlpha and frameHasAlphaAtIndex is unplugged then frameDurationAtIndex and frameIsCompleteAtIndex are both internal to BitmapImage and only meaningful to GIF, fixing that to work with asynchronous image decoding will have to do with refactoring the GIF image decoder. Again this is doable.

Fantastic! I'm looking forward to your patch. Your patch will replace our preparing patch (Bug 90443). After you submit the patch, I'll merge parallel image decoders with your patch.
Comment 49 Dongseong Hwang 2012-08-01 01:32:10 PDT
UPDATE! Here are the the test results.

Two tests were reinforced.
1. Response times of interactive web pages.
2. First painting and response time on a 2 core embedded device.

Previously, we measured the first painting time on a high-end PC (Xeon 5650).
We are curious about two questions.
1. How to show other advantages of parallel image decoders in addition to the first painting time.
2. Do parallel image decoders perform well on embedded devices?

See the results in the following link.
https://docs.google.com/spreadsheet/pub?key=0Ar2smwimcenMdGpnTEw2clZjSTNkbXNFNFM5dkYyRGc&output=html


@ Summary of the results
1. The response time of interactive web pages were improved by  approximately 20%.
2. An embedded device shows a similar performance enhancement as a high-end PC.


@ Test Environment
1. High-end PC: Intel ® Xeon (R) CPU X5650@2.67GHz × 6
2. Embeded device: Pandaboard ARM Cortex-A9@1.2GHz × 2

We Used WebKit1 compiled with Qt4.8.1.



@ Details of each test

eBay Scrolling Test: Gmarket is South Korea's eBay. We measured the FPS of scrolling the page, which included many pretty girls' pictures.
High-end PC: Measured 99 FPS in the threaded case. The FPS limit in Qt WebKit1 is 100 FPS. I think the threaded case might exceed 99 FPS in other ports.
Embeded device: 21% performance improvement.

35 Images: There are 35 large images scaled down to fit on the screen.
High-end PC: The first painting time is four times faster when MemoryCache cached raw data.
Embeded device: The first painting time is 20% faster when MemoryCache cached raw data.

DOMContentLoaded: There are 3 large images that are scaled down to fit on one page. This test belongs to IE Test Drive.
High-end PC: The first painting time is 56% faster when MemoryCache cached raw data.
Embeded device: The first painting time is 20% faster when MemoryCache cached raw data.

RomainGuy: Amateur photographer's blog.
High-end PC: The first painting time is 17% faster when MemoryCache cached raw data.
Embeded device: It makes very little difference.

Apple: iMac's introduction page
High-end PC: The first painting time is 82% faster when MemoryCache cached raw data.

Tumblr : There are many medium size images in the blog.
High-end PC: The first painting time is 7.5% faster when MemoryCache cached raw data.


@ The interpretation of the test results
1. Interactive page’s response times are improved. Image decoding processes off the main thread, so the main thread can concentrate on the animation or scrolling.
2. 35 Images and DOMContentLoaded has improved the first painting time a lot. It is because decoding several large images takes advantage of multiple CPU cores.
On the other hand, RomainGuy, Apple and Tumblr need a closer look for interpretation of the results, although three sites are very similar, only the Apple case has exceptionally high performance improvement.
In the Apple case, there is a lot of work done in the main thread after requesting image decoding from the parallel image decoders. Web Inspector shows CSS styling, layout and rendering are performed consistently. However, in the RomainGuy and Tumblr cases, the request for image decoding is near end of the page's loading and painting. After the main thread requests image decoding, it usually waits for decoding to complete without any other heavy jobs running. I removed Javascript code in order to preventing resources loading dynamically, which could exacerbate the situation.
Currently, The trigger conditions for parallel Image decoders are image size, the number of CPU cores and image format. In the future, we can add the work load of the main thread to the trigger conditions, to reduce unnecessary thread communication overhead.


@ Settings for testing
1. All test data was stored locally in order to avoid the effects of network latency. I removed most of the Javascript code in order to prevent resources from loading dynamically.
2. The difference in full loading time between threaded and originally referenced cases is negligible. This is because network latency is the dominant factor, so the main thread often waits until finishing sub resource loading. We kept most CachedImages in MemoryCache for a more precise measurement of image decoding time.

Code stub for #2.
settings()->setUsesPageCache(false);
memoryCache()->setDeadDecodedDataDeletionInterval(0.01);
memoryCache()->setCapacities(0, 256MB, 256MB);


@ Details of test embedded device.
- Ubuntu 12.04 on Pandaboard. http://pandaboard.org/
- Xorg occupies about 35% of the CPU, so parallel image decoders can not fully use both cores.
- Set Qt configuration like Bug 84321.
- printf from a shared library does not output anything on the terminal, so I compiled WebKit as static.
- The static build caused crashes on the sites with external javascript files, so I tested only the sites that do not crash.
Comment 50 Hin-Chung Lam 2012-08-01 15:12:47 PDT
Created attachment 155898 [details]
Remove Image::currentFrameHasAlpha()

This patch I removed Image::currentFrameHasAlpha() and have it defined only on CG port.

Also removed the usage of this method in multiple places.
Comment 51 Hin-Chung Lam 2012-08-01 15:13:36 PDT
Created attachment 155899 [details]
Remove Image::currentFrameHasAlpha()

Upload again and mark as patch.
Comment 52 Hin-Chung Lam 2012-08-01 16:20:16 PDT
Now that I have drafted patch to remove currentFrameHasAlpha() and understand the problem better I'm going back to our discussion before.

> 
> I suggested RetainedModeBitmapImage instead of setter because of it.

I think having a wrapper class is fine, as long as there's only one way to trigger image decoding, i.e. nativeImageForCurrentFrame().

> 
> 
> > 3. Limit the asynchronous interface only to ImageSource
> > 
> > I noticed in your patch that ImageSource and ImageDecoder are both asynchronous. I would like to reduce this down to only ImageSource. In fact because ImageSource.cpp is only used by non-cg port, we can just implement a parallel decoding mode there and keep ImageDecoder always synchronous.
> > 
> 
> I think your concern is in that ImageSource delegates ImageDecoder to ParallelImageDecoder. Maybe because ImageSource must directly call ImageDecoder::frameHasAlphaAtIndex and other metadata query functions.
> If ParallelImageDecoder creates its own ImageDecoder, it will be ok.

My concern here is to have a simpler interface and contract between classes, and having fewer asynchronous interfaces helps. My suggestion is to keep ImageDecoder synchronous but ImageSource doing the work to make things parallel.
Comment 53 Hin-Chung Lam 2012-08-01 16:44:06 PDT
Now that I tried to remove currentFrameHasAlpha, I found the results not satisfactory. The reason is I need to disable some optimizations to achieve this, which might upset users that rely on this API.

I think it's better to keep Image::currentFrameHasAlpha() but change the implementation in BitmapImage. Like the patch I drafted BitmapImage shouldn't call ensureFrameIsCached() to decode a frame and return the result. Instead BitmapImage should just return the value of ImageSource::frameHasAlphaAtIndex() which calls into the image decoder to answer the call.

Yes it has the problem that you mentioned before, e.g.

image->currentFrameHasAlpha() // return true
... some time later ...
NativeImagePtr* image = image->nativeImageForCurrentFrame(); // image doesn't have alpha

However it really doesn't matter, assuming an image has alpha is safe. Look at ImageSourceCG.cpp for example, it shows that currentFrameHasAlpha() doesn't need to return the correct value.

bool ImageSource::frameHasAlphaAtIndex(size_t)
{
    if (!m_decoder)
        return false;

    CFStringRef imageType = CGImageSourceGetType(m_decoder);

    // Return false if there is no image type or the image type is JPEG, because
    // JPEG does not support alpha transparency.
    if (!imageType || CFEqual(imageType, CFSTR("public.jpeg")))
        return false;

    // FIXME: Could return false for other non-transparent image formats.
    // FIXME: Could maybe return false for a GIF Frame if we have enough info in the GIF properties dictionary
    // to determine whether or not a transparent color was defined.
    return true;
}

Instead it is enough for currentFrameHasAlpha to return the best guess. By calling directly into each image decoder we can answer this call good enough, until the image is fully decoded.

So I suggest the first step is to remove decoding caused by currentFrameHasAlpha, this way nativeImageForCurrentFrame being the only public method that triggers image decoding.

frameIsCompleteAtIndex and frameDurationAtIndex are for GIF animation, I would consider them a separate problem.
Comment 54 Dongseong Hwang 2012-08-01 18:20:47 PDT
(In reply to comment #52)
> Now that I have drafted patch to remove currentFrameHasAlpha() and understand the problem better I'm going back to our discussion before.
> 
> I think having a wrapper class is fine, as long as there's only one way to trigger image decoding, i.e. nativeImageForCurrentFrame().

Yes, we use the wrapper class only with nativeImageForCurrentFrame() and GraphicsContext::draw().


> > I think your concern is in that ImageSource delegates ImageDecoder to ParallelImageDecoder. Maybe because ImageSource must directly call ImageDecoder::frameHasAlphaAtIndex and other metadata query functions.
> > If ParallelImageDecoder creates its own ImageDecoder, it will be ok.
> 
> My concern here is to have a simpler interface and contract between classes, and having fewer asynchronous interfaces helps. My suggestion is to keep ImageDecoder synchronous but ImageSource doing the work to make things parallel.

Frankly, I don't understand fully what you concerned. Our parallel image decoder implementation keeps ImageDecoder synchronous. Our implementation introduced ParallelImageDecoder, and ParallelImageDecoder has asynchronous API, because ParallelImageDecoder decodes image asynchronously. I think it is the implementation detail of ImageSource, and other classes can not know what is ParallelImageDecoder.
If I may misunderstand your opinion, could you feedback to me again?
Comment 55 Hin-Chung Lam 2012-08-01 18:23:33 PDT
> 
> Frankly, I don't understand fully what you concerned. Our parallel image decoder implementation keeps ImageDecoder synchronous. Our implementation introduced ParallelImageDecoder, and ParallelImageDecoder has asynchronous API, because ParallelImageDecoder decodes image asynchronously. I think it is the implementation detail of ImageSource, and other classes can not know what is ParallelImageDecoder.
> If I may misunderstand your opinion, could you feedback to me again?

Whoops I misunderstood the code. Now I look at it again it makes sense to me.
Comment 56 Dongseong Hwang 2012-08-01 18:37:32 PDT
(In reply to comment #53)
> Now that I tried to remove currentFrameHasAlpha, I found the results not satisfactory. The reason is I need to disable some optimizations to achieve this, which might upset users that rely on this API.

Your patch said there are two clients: Texmap and box shadow painting.

I know Texmap well and I think we can overcome about Texmap. I can remove currentFrameHasAlpha() completely in the code related to Texmap.
However, I have lack of knowledge about box shadow painting. My gut feeling still tells me by removing currentFrameHasAlpha() completely.
Otherwise, clients could be confused whether they use currentFrameHasAlpha() or a platform specific function.

> 
> Instead it is enough for currentFrameHasAlpha to return the best guess. By calling directly into each image decoder we can answer this call good enough, until the image is fully decoded.
> 
> So I suggest the first step is to remove decoding caused by currentFrameHasAlpha, this way nativeImageForCurrentFrame being the only public method that triggers image decoding.
> 
> frameIsCompleteAtIndex and frameDurationAtIndex are for GIF animation, I would consider them a separate problem.

I want to try to remove meta data query functions based on your patch, and then refactory parallel image decoders on the preparation.
parallel image decoders changes almost only ImageSource. I think parallel image decoders will get along with your async image decoder.

If you don't mind, I will post a patch in several hours later.

I am curious about the progress of your async image decoder, how we collaborate the preparation and how we commit our async and parallel image decoders.
Comment 57 Hin-Chung Lam 2012-08-01 18:58:47 PDT
> I know Texmap well and I think we can overcome about Texmap. I can remove currentFrameHasAlpha() completely in the code related to Texmap.
> However, I have lack of knowledge about box shadow painting. My gut feeling still tells me by removing currentFrameHasAlpha() completely.
> Otherwise, clients could be confused whether they use currentFrameHasAlpha() or a platform specific function.

If we can remove this method that would be the best outcome.

This method is also used in LayerTreeCoordinator.cpp which is in WebKit2 and I have no knowledge of.

RenderImage's use of currentFrameHasAlpha comes from this change 91628 and 65030 is filed to return the correct value for PNG and GIF. I would check with Simon Fraser to see if it is okay to remove the optimization for RenderImage. (Just added him to this bug.)

> I want to try to remove meta data query functions based on your patch, and then refactory parallel image decoders on the preparation.
> parallel image decoders changes almost only ImageSource. I think parallel image decoders will get along with your async image decoder.
> 
> If you don't mind, I will post a patch in several hours later.
> 
> I am curious about the progress of your async image decoder, how we collaborate the preparation and how we commit our async and parallel image decoders.

Please go ahead and take the patch. We are still in the progress of refactoring, which is what we're discussing now. :)

I think once we are both happy with the refactoring we can put the refactoring patch up for review first. If you're happy trading patches with buganizer just like what we're doing now we can continue with this method.
Comment 58 Dongseong Hwang 2012-08-01 19:10:55 PDT
(In reply to comment #57)
> 
> This method is also used in LayerTreeCoordinator.cpp which is in WebKit2 and I have no knowledge of.

It is related to Texmap in WebKit2. I know this.


> 
> Please go ahead and take the patch. We are still in the progress of refactoring, which is what we're discussing now. :)

Thank you for your kind response. I'm relatively newbie at WebKit, and I have not collaborated with people of WebKit community yet. I'm very glad to collaborate with kind people in here like you. :)
Comment 59 Simon Fraser (smfr) 2012-08-01 19:11:02 PDT
I would prefer not to remove the optimizations that currentFrameHasAlpha() allows. I hope we can make more, not less use of it in future.
Comment 60 Dongseong Hwang 2012-08-01 19:47:02 PDT
(In reply to comment #59)
> I would prefer not to remove the optimizations that currentFrameHasAlpha() allows. I hope we can make more, not less use of it in future.

I see.  As Hin-Chung mentioned at comment #53, currentFrameHasAlpha returns the best guess.

I'll post the WIP patch of changed parallel image decoders based on what we discussed.
Comment 61 Hin-Chung Lam 2012-08-01 19:49:37 PDT
(In reply to comment #59)
> I would prefer not to remove the optimizations that currentFrameHasAlpha() allows. I hope we can make more, not less use of it in future.

I agree that having the information of alpha channel can allow some nice optimizations. Currently as it is now is it a blocker for having parallel (or deferred) image decoding. Particularly the implementation requires decoding an entire frame to get this information. We would like to start image decoding only during painting.

For most cases getting the alpha channel can be done differently. Many use cases of this method is to optimize texture upload or having an opaque texture. For these use cases that involve NativeImagePtr each port can get query the native image directly.

The use of it in RenderImage is the outstanding one that doesn't involve a NativeImagePtr. It seems to benefit a specific use case. According to http://trac.webkit.org/changeset/91628 it is the case of JPEG <img> with a GIF background. Maybe we should do this optimization a different to allow parallel image decoding? For example halt GIF animation in the background?
Comment 62 Dongseong Hwang 2012-08-02 06:07:10 PDT
Created attachment 156058 [details]
Parallel image decoders based on Lam's preparation patches.

This is WIP. You can build this patch on Qt. I think this patch can be compiled on other ports but I have not tested yet.

I merged parallel image decoders with Lam's preparation patches.

In this patch, only draw() and nativeImagePtr() can trigger parallel image decoding.

The difference between previous and this parallel image decoders implementation is that ImageSource does not delegate ImageDecoder to ParallelImageDecoder in this patch.
ParallelImageDecoder creates its own ImageDecoder in ParallelImageDecoder::initialize().
So, the main thread is free to access ImageSource::m_imageDecoder.
When we call meta data query functions to BitmapImage, BitmapImage calls the functions of ImageSource and ImageSource calls the functions of ImageDecoder.

I am curious whether this implementation can get along with Lam's differed image decoder.
I welcome all folks' opinion. :)
Comment 63 Hin-Chung Lam 2012-08-02 15:29:39 PDT
Created attachment 156182 [details]
Remove sync image decoding from currentFrameHasAlpha and currentFrameOrientation

Thanks for merging and the new patch. I still have to go through your patch. At the same time I want to put our focus onto removing sync image decoding in the metadata functions. This patch removes sync image decoding from currentFrameHasAlpha() and currentFrameOrientation(). This should allow async image decoding for images other than GIF. I want to put this for review if you think this is right.
Comment 64 Hin-Chung Lam 2012-08-02 17:02:01 PDT
Comment on attachment 156058 [details]
Parallel image decoders based on Lam's preparation patches.

View in context: https://bugs.webkit.org/attachment.cgi?id=156058&action=review

> Source/WebCore/platform/graphics/ImageSource.cpp:191
> +    if (fileExt == "ico" || fileExt == "gif")

I looked again to see if it is possible to make BitmapImage::frameDurationAtIndex() and BitmapImage::frameIsCompleteAtIndex() not trigger image decoding. It proved to be very difficult to do so correctly. I think we're much safer to leave these two methods untouched. This is okay because they are used only with GIF animation which you excluded from parallel image decoding already.

To loosen the restriction of this line and still have the correct behavior is to check for frameCount(). If the frameCount() == 1 it's okay to be GIF and ICO.
Comment 65 Dongseong Hwang 2012-08-02 19:10:05 PDT
(In reply to comment #63)

When I pushed the Publish button in Review Patch, Bugzilla said "You may not search, or create saved searches, without any search terms."
So, I comment by hand.


> Source/WebCore/ChangeLog:3
> +    BitmapImage::currentFrameOrientation() shouldn't decode the current frame

You already made BitmapImage::currentFrameOrientation() and BitmapImage::currentFrameHasAlpha() not decode the current frame in the previous patch.
However, this patch is little bit different. :)



> Source/WebCore/platform/graphics/ImageSource.cpp:193
> -     return !frameIsCompleteAtIndex(index)
> -         || m_decoder->frameBufferAtIndex(index)->hasAlpha();
> +    return !m_decoder->isAllDataReceived()
> +        || m_decoder->frameHasAlphaAtIndex(index);

We discussed frameIsCompleteAtIndex also does not trigger decoding, but this patch seems frameIsCompleteAtIndex can trigger decoding.
I think you concern that GIF animation need to trigger decoding when calling frameIsCompleteAtIndex.
There are only two places which need to decode when calling frameIsCompleteAtIndex. Both are in BitmapImage::startAnimation.
How about change the code like this.

in BitmapImage::startAnimation about 433 line

    ensureFrameIsCached(nextFrame);          <-- add this line.
    if (!m_allDataReceived && !frameIsCompleteAtIndex(nextFrame)) {
        return;
    }
Comment 66 Dongseong Hwang 2012-08-02 19:16:19 PDT
(In reply to comment #64)
> I looked again to see if it is possible to make BitmapImage::frameDurationAtIndex() and BitmapImage::frameIsCompleteAtIndex() not trigger image decoding. It proved to be very difficult to do so correctly. I think we're much safer to leave these two methods untouched. This is okay because they are used only with GIF animation which you excluded from parallel image decoding already.
As I mentions in comment #65, I think we can touch BitmapImage::frameIsCompleteAtIndex() because only BitmapImage::startAnimation calls this method.
It is hard to make GIF decoder set frameDurationAtIndex without decoding, but frameDurationAtIndex is also called only from BitmapImage::startAnimation.
I think if we manually trigger image decoding in the start of BitmapImage::startAnimation, we can touch both BitmapImage::frameDurationAtIndex() and BitmapImage::frameIsCompleteAtIndex() because it is the implementation detail of BitmapImage.

> To loosen the restriction of this line and still have the correct behavior is to check for frameCount(). If the frameCount() == 1 it's okay to be GIF and ICO.
Absolutely!
Comment 67 Hin-Chung Lam 2012-08-02 19:49:06 PDT
> You already made BitmapImage::currentFrameOrientation() and BitmapImage::currentFrameHasAlpha() not decode the current frame in the previous patch.
> However, this patch is little bit different. :)

Yeah sorry about going backwards.

> in BitmapImage::startAnimation about 433 line
> 
>     ensureFrameIsCached(nextFrame);          <-- add this line.
>     if (!m_allDataReceived && !frameIsCompleteAtIndex(nextFrame)) {
>         return;
>     }

I think just unplugging image decoding from currentFrameHasAlpha() can get us going with with everything except multiframe ICO and GIF already. The benefit of supporting ICO and GIF seems small given the complexity to do it correctly.

My thought is frameIsCompleteAtIndex() and frameDurationAtIndex() are only needed for GIF animation, which your parallel image decoding don't support anyway. Thinking about what happens if we make animated GIF decoding asynchronous, I think we'll run into problems if frameIsCompleteAtIndex() doesn't do decoding. If you look into startAnimation() (the for loop), it actually looks to see if the next frame is complete before switching to the new frame, and because decoding of that frame has not started it will never be completed and hence animation will stall. That's why I back off from removing decoding from frameIsCompleteAtIndex() and frameDurationAtIndex().

startAnimation() is meaningful for multiframe images anyway and with your patch this is not parallel. I suggest we exclude this at the moment. Make single frame image decoding work great first, and then multiframe images (which involves fixing frameIsCompleteAtIndex and frameDurationAtIndex).
Comment 68 Hin-Chung Lam 2012-08-02 19:55:26 PDT
The latest patch I uploaded is a tiny change, but it does what our deferred image decoding needs, i.e. only draw triggers decoding of a single frame image. If that looks good to you I'll put it up for review.

And thanks for looking!
Comment 69 Dongseong Hwang 2012-08-02 20:24:20 PDT
(In reply to comment #67)
> I think just unplugging image decoding from currentFrameHasAlpha() can get us going with with everything except multiframe ICO and GIF already. The benefit of supporting ICO and GIF seems small given the complexity to do it correctly.

I agreed on you, so parallel image decoders don't support ICO and GIF.

> 
> That's why I back off from removing decoding from frameIsCompleteAtIndex() and frameDurationAtIndex().

I think we can make it clear when to start image decoding.
After we discussed, I think there are only two methods which should trigger decoding: BitmapImage::frameAtIndex and BitmapImage::startAnimation, not frameIsCompleteAtIndex() and frameDurationAtIndex().

If we don't make frameIsCompleteAtIndex() not trigger image decoding, I'm afraid that other developers will be easy to make a mistake to use frameIsCompleteAtIndex() and it can cause to stop parallel image decoding.
Comment 70 Dongseong Hwang 2012-08-02 20:31:51 PDT
(In reply to comment #68)
> The latest patch I uploaded is a tiny change, but it does what our deferred image decoding needs, i.e. only draw triggers decoding of a single frame image. If that looks good to you I'll put it up for review.

All LGTM. I want you to put it up for review. If any reviewers will review the deferred image decoding, could they review parallel image decoders too?

> 
> And thanks for looking!

Thank you too. I really appreciate all discussions with you, Zoltan and other folks. :)
Comment 71 Hin-Chung Lam 2012-08-03 19:05:39 PDT
> If we don't make frameIsCompleteAtIndex() not trigger image decoding, I'm afraid that other developers will be easy to make a mistake to use frameIsCompleteAtIndex() and it can cause to stop parallel image decoding.

To do this I think we need to detach frameIsCompleteAtIndex() from general image frames and relate it only to animated GIF. My thought is createFrameAtIndex() (or in the future requestFrameAtIndex()) would include the state of the frame, i.e. isComplete, etc. This way we can push the logic of GIF animation into GIF image decoder (or another class).

Before that we'll need to modify GIFImageDecoder to give these information before actually a frame. It feels like a distraction to me though.
Comment 72 Hin-Chung Lam 2012-08-03 19:07:20 PDT
I've created a bug: https://bugs.webkit.org/show_bug.cgi?id=93171 to remove image decoding from frameHasAlphaAtIndex and frameOrientationAtIndex. Patch is also uploaded for review.
Comment 73 Dongseong Hwang 2012-08-04 01:01:15 PDT
(In reply to comment #71)
> To do this I think we need to detach frameIsCompleteAtIndex() from general image frames and relate it only to animated GIF. My thought is createFrameAtIndex() (or in the future requestFrameAtIndex()) would include the state of the frame, i.e. isComplete, etc. This way we can push the logic of GIF animation into GIF image decoder (or another class).

As you mentioned, only GIF animation uses frameIsCompleteAtIndex() and frameDurationAtIndex(), and both methods are not public, so both methods are implementation detail.
I wish other developers don't misuse both methods after we complete async image decoding implementations. So, I renamed both methods in order to prevent from making mistakes.
frameDurationAtIndexDecodingIfNecessary()
frameIsCompleteAtIndexDecodingIfNecessary()

You can see this in the next patch that I will post.
Comment 74 Dongseong Hwang 2012-08-04 01:15:33 PDT
Created attachment 156525 [details]
Parallel image decoders based on Lam's preparation patches.

I fixed some bugs.

This patch is huge, so I made a github repository. I will update the repository in real-time. :)
https://github.com/luxtella/webkit
git://github.com/luxtella/webkit.git
parallel-image-decoders branch

I rebased parallel image decoder based on Lam's preparation patches including Bug 93171.
I wish Lam's patches land ASAP, because our parallel image decoder depends on Lam's patches.
Comment 75 Hin-Chung Lam 2012-08-06 16:02:13 PDT
Dongsung,

Thanks for setting up the repo. To get to parallel image decoding, I think we should do it in two stages, here's some concrete steps:

Stage 1: Asynchronous image decoding

AI 1: Remove unncessary ops that trigger image decoding: https://bugs.webkit.org/show_bug.cgi?id=93171

AI 2: Modify testing framework, adding more layout test and pixel test for image decoding, add flags to enable asynchronous image decoding in testing framework

AI 3: Break out changes to ImageSource to be asynchronous and review

AI 4: Define and implement AsynchronousImageDecoder, this will just forward to existing synchronous image decoders, enable it on Qt port, test with layout test

AI 5: Break out RetainedModeBitmapImage, add usage to WebCore, test with layout test to ensure no failures

AI 6: Implement AsynchronousImageDecoder with proper asynchronous operations in BitmapImage, test with layout test to ensure no failures

-- WebKit tested and ready to use asynchronous image decoders, at least on Qt port --

Stage 2: Parallel image decoding

AI 7: Implement parallel image decoding with tests


I think from 1 to 6 are things that we should both contribute and cross verify. I'm going to work on testing to cover cases currently missed, it will be great if you can do AI 3 which is to break out a small patch that refactor ImageSource to be asynchronous for review. Thanks.
Comment 76 Dongseong Hwang 2012-08-06 20:54:38 PDT
(In reply to comment #75)
> Thanks for setting up the repo. To get to parallel image decoding, I think we should do it in two stages, here's some concrete steps:

I made a mistake about setting up the repo last Saturday. You can access the repo now.
Thanks for suggesting appropriate concrete steps.

> AI 3: Break out changes to ImageSource to be asynchronous and review

Ok, I will extract an async interface from ImageSource, and file the bug.
I want to add unit tests to verify the async interface of ImageSource, but I heard you and some reviewers were discussing about image decoding test, and concluded to improve DRT first.
But, I also think we should have some unit tests for image decoding.
Could you explain in detail how we would build up the test frameworks to verify async image decoding? I think you have some plans. :)
Comment 77 Nick Carter 2012-08-09 14:55:50 PDT
I'm a little bit worried about the rendering behavior of the parallel image decoding as proposed here. I haven't interacted with a compiled binary containing these changes, but here is what I expect it to behave like. luxtella or skyul, can you respond to the following analysis?

 - The first attempt to draw a BitmapImage essentially sets up a race between the parallel decode, and the draw operation's attempt to read the frame.
 - Given the timings involved, the draw operation will likely proceed before the decode operation really starts in earnest.
 - Usually, we'll expect the draw operations to complete next frame.
 - As a result, we'll visually see a flash of a transparent region while waiting for the image data to become available. That flash is arguably justifiable (?) since images are expected to load progressively anyway.
 - If an <img>'s data is available immediately (e.g. the compressed image data is in cache because it was prefetched via XHR), the race parallel decode will typically add a frame of latency to the presentation of the image.
 - If the MemoryCache is under extreme pressure and thrashing, what used to be a performance problem (slower paint operations because allocation/decoding happens each draw operation) will devolve to a kind of nondeterministic behavior (some timing-dependent subset of decoded images -- possibly all images -- are never available at draw time, because the parallel decode work is evicted before it's consumed in the next frame).

Incurring flashing or an extra frame of latency to the visibility of every new <img> element, this does not seem desirable if serial decoding would only take a small fraction of the frame time.
Comment 78 Dongseong Hwang 2012-08-10 05:22:35 PDT
(In reply to comment #77)
> I'm a little bit worried about the rendering behavior of the parallel image decoding as proposed here. I haven't interacted with a compiled binary containing these changes, but here is what I expect it to behave like. luxtella or skyul, can you respond to the following analysis?

Very good questions! I answered each question in the below. If you think my answer is not detailed enough, please let me know.

>  - The first attempt to draw a BitmapImage essentially sets up a race between the parallel decode, and the draw operation's attempt to read the frame.

There is no race at all. The parallel image decoder does not access the frame being read by the draw operation. It sends segments of decoded data to the main thread, then the main thread creates a frame by appending these segments.

Please refer to 3.6 Data transfers in the design document.
https://docs.google.com/document/pub?id=12gf7MhNHfupeR3GdRF-h2Vzg86viCbwXq8_JByHKlg0

>  - Given the timings involved, the draw operation will likely proceed before the decode operation really starts in earnest.

If the parallel image decoder cannot create a frame, ImageSource returns 0 to BitmapImage.  BitmapImage::draw() skips drawing in this case.

void BitmapImage::draw(GraphicsContext* ctxt, const FloatRect& dstRect,
                       const FloatRect& srcRect, ColorSpace colorSpace, CompositeOperator compositeOp)
{
    ...
    NativeImageSkia* bm = nativeImageForCurrentFrame();
    if (!bm)
        return; // It's too early and we don't have an image yet.
    ...
}

>  - Usually, we'll expect the draw operations to complete next frame.

The parallel image decoder does not handle multi-frame images because the gain is small while the complexity is so big. There is no behavior change in this case. 

>  - As a result, we'll visually see a flash of a transparent region while waiting for the image data to become available. That flash is arguably justifiable (?) since images are expected to load progressively anyway.

My feeling is that the parallel image decoder does not seem to worsen a flash of a transparent region compared to the current implementation. However, I found one particular case which indeed causes flashing. I'l will explain more on this at the end of this comment.

>  - If an <img>'s data is available immediately (e.g. the compressed image data is in cache because it was prefetched via XHR), the race parallel decode will typically add a frame of latency to the presentation of the image.

Even if an <img>'s data is available, the parallel image decoder can give us benefits. For example, when encoded data of all images are cached, loading the following site still causes about 1 sec UI hang-up. (http://www.dorothybrowser.com/test/webkitTest/imgdecode/bgimage-png/test.html) But I couldn't feel any hang-up after applying the parallel image decoder.

If ImageSource has the decoded frame, it simply returns the decoded frame without triggering parallel image decoding.


>  - If the MemoryCache is under extreme pressure and thrashing, what used to be a performance problem (slower paint operations because allocation/decoding happens each draw operation) will devolve to a kind of nondeterministic behavior (some timing-dependent subset of decoded images -- possibly all images -- are never available at draw time, because the parallel decode work is evicted before it's consumed in the next frame).

Good question! We fixed Bug 90721 to avoid this problem. After Bug 90721, MemoryCache no longer removes the decoded data of CahcedImage if image will be rendered soon. So the parallel image decoding work is never evicted if clients need to render the image.

> 
> Incurring flashing or an extra frame of latency to the visibility of every new <img> element, this does not seem desirable if serial decoding would only take a small fraction of the frame time.

As I mentioned above, I don't think the parallel image decoder worsens the situation here. However, I found your concern makes sense in the following site: http://www.dorothybrowser.com/test/webkitTest/imgdecode/gmarket/kiru.kr/kiru/html/singlelongopst.htm

After applying parallel image decoding, we could notice flashing when we put the mouse cursor over images. When I saw this at the first time, I was very curious why the image in the back was flashing.  After I inspected the site, I realized that the images are not just two images: one transparent image in the front and the other in the back. They are one normal image and the other hover image. 

When we put the mouse over these images, the normal image is set to 'display:hidden' and the hover image is set to 'display:block'. So while the parallel image decoder is decoding the hover image, WebKit renders a white background instead. This causes flashing. If the site developer showed the transparent image in the front on top of the image in the back, flashing would've disappeared. This is one rare case where parallel image decoder actually causes flashing. I expect we won't encounter this kind of flashing in most sites.

I want to listen opinions from all folks whether this behavior is acceptable or not.
Comment 79 Dongseong Hwang 2012-08-10 05:37:30 PDT
(In reply to comment #74)
> https://github.com/luxtella/webkit
> git://github.com/luxtella/webkit.git
> parallel-image-decoders branch

Updated the github repository to reflect the latest changes.
Comment 80 Hin-Chung Lam 2012-08-10 16:10:25 PDT
> I want to listen opinions from all folks whether this behavior is acceptable or not.

The same issue of flashing is raised by James on #webkit-dev. My suggestion would be to experiment with starting image decoding during layout time.

Here's an experiment:

In RenderImage::layout() after layout rect is computed you can see if this->absoluteContentBox() and view()->viewRect() intersects. If they do you can start image decoding there (given that frameCount() == 1). Then in ParallelImageDecoder, instead of cancel the last asynchronous image decoding it waits until the decode is completed. I suspect this can bring down load time even further for complex sites because decoding starts much earlier.
Comment 81 Dongseong Hwang 2012-08-13 03:01:47 PDT
(In reply to comment #80)
> In RenderImage::layout() after layout rect is computed you can see if this->absoluteContentBox() and view()->viewRect() intersects. If they do you can start image decoding there (given that frameCount() == 1). Then in ParallelImageDecoder, instead of cancel the last asynchronous image decoding it waits until the decode is completed. I suspect this can bring down load time even further for complex sites because decoding starts much earlier.

If we should trigger image decoding before painting, I also think layout time would be the best. However, I'm concerned about the following issues.
1. the hooking is needed by only parallel image decoder, not deferred image decoder.
2. I think the interval between layout time and painting time is short. So, many images will make parallel image decoder stop, which causes synchronization overhead. But we need to test it if we want to discuss more about this.
Comment 82 Hin-Chung Lam 2012-08-13 11:02:35 PDT
(In reply to comment #81)
> (In reply to comment #80)
> > In RenderImage::layout() after layout rect is computed you can see if this->absoluteContentBox() and view()->viewRect() intersects. If they do you can start image decoding there (given that frameCount() == 1). Then in ParallelImageDecoder, instead of cancel the last asynchronous image decoding it waits until the decode is completed. I suspect this can bring down load time even further for complex sites because decoding starts much earlier.
> 
> If we should trigger image decoding before painting, I also think layout time would be the best. However, I'm concerned about the following issues.
> 1. the hooking is needed by only parallel image decoder, not deferred image decoder.

Yes I agree. We don't be able to use NonImmediateBitmapImage but that I think we can live with.

> 2. I think the interval between layout time and painting time is short. So, many images will make parallel image decoder stop, which causes synchronization overhead. But we need to test it if we want to discuss more about this.

The time maybe small but there should still be material improvements over existing model. Because we kick start multiple image decoding during layout time overall initial paint time would be limited by roughly the hardest to decode image.
Comment 83 Loïc Yhuel 2012-08-13 22:36:51 PDT
(In reply to comment #78)
> >  - If the MemoryCache is under extreme pressure and thrashing, what used to be a performance problem (slower paint operations because allocation/decoding happens each draw operation) will devolve to a kind of nondeterministic behavior (some timing-dependent subset of decoded images -- possibly all images -- are never available at draw time, because the parallel decode work is evicted before it's consumed in the next frame).
> 
> Good question! We fixed Bug 90721 to avoid this problem. After Bug 90721, MemoryCache no longer removes the decoded data of CahcedImage if image will be rendered soon. So the parallel image decoding work is never evicted if clients need to render the image.
> 
So, if I understand correctly, in these cases the number of decoded images in the cache could be a lot higher, so the slow rendering could be replaced by an out of memory error on embedded devices. Perhaps it would be better to disable parallel decoding.
Comment 84 Dongseong Hwang 2012-08-16 02:17:50 PDT
(In reply to comment #83)
> So, if I understand correctly, in these cases the number of decoded images in the cache could be a lot higher, so the slow rendering could be replaced by an out of memory error on embedded devices. Perhaps it would be better to disable parallel decoding.

Yes. Bug 90721 can cause OOM on embedded devices as you concerned, but this patch is a image decoding policy change regarding space-time trade-offs. It applies equally to the current serial image decoding as well as parallel image decoding. Parallel image decoder does not spend more memory because of this patch.

Previously, the memory cache caused repeated image decoding when the cache capacity of MemoryCache is exceeded. Animation and scrolling could slow down dramatically because both could trigger image decoding every frame. Bug 90721 fixed this behavior by preventing decoded images inside the viewport from being destroyed. You can argue on this behavior change, but it is not directly relevant to parallel image decoding.

It is true that parallel image decoder spends more space; kernel objects to run more threads and temporary data used to transfer encoded and decoded data between the main thread and decoder threads. If you want to save these additional memory consumption on embedded devices, you can turn off parallel image decoder.

Currently, MemoryCache removes only the decoded data of live caches. If the size of encoded data of cached images is big enough, WebKit on embedded devices still can throw an OOM error. We need a more sophisticated policy to handle this situation.
Comment 85 Peter Gal 2013-03-18 03:53:52 PDT
While doing some tests there were a few times when the jpeg decoder printed errors (like this: "Corrupt JPEG data: 76 extraneous bytes before marker 0xd9") and some of the images were partially decoded contained errors in it. 

Did you encountered such errors? Do you have got any idea why is this happening or how can it be fixed?