Bug 111179 - [Cairo] Surface pointer passed to asNewNativeImage() might be freed.
Summary: [Cairo] Surface pointer passed to asNewNativeImage() might be freed.
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Platform (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Martin Robinson
URL:
Keywords:
: 16200 48895 80754 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-03-01 07:12 PST by Zoltan Herczeg
Modified: 2017-03-06 10:38 PST (History)
15 users (show)

See Also:


Attachments
early draft patch (866 bytes, patch)
2013-03-01 07:12 PST, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
patch (2.14 KB, patch)
2013-03-12 03:19 PDT, Zoltan Herczeg
no flags Details | Formatted Diff | Diff
refcount-imageframe-data.patch (2.83 KB, patch)
2014-04-10 10:17 PDT, Ed Catmur
no flags Details | Formatted Diff | Diff
imageframe-use-refcounted-array.patch (2.36 KB, patch)
2014-04-10 10:20 PDT, Ed Catmur
no flags Details | Formatted Diff | Diff
Workaround patch (1.78 KB, patch)
2014-06-16 02:23 PDT, cand
no flags Details | Formatted Diff | Diff
Archive of layout-test-results from webkit-ews-10 for mac-mountainlion-wk2 (598.34 KB, application/zip)
2014-06-16 16:13 PDT, Build Bot
no flags Details
Spinning cube files for demonstrating crashes. (281.72 KB, application/zip)
2015-08-09 09:53 PDT, Jeremy Zerfas
no flags Details
Patch (2.55 KB, patch)
2016-03-07 14:38 PST, Martin Robinson
no flags Details | Formatted Diff | Diff
Patch (2.52 KB, patch)
2016-03-09 19:08 PST, Martin Robinson
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zoltan Herczeg 2013-03-01 07:12:21 PST
Created attachment 190964 [details]
early draft patch

Hi Martin,

you made a patch a year ago which refactored how Cairo wrap surfaces. The side effect of your "just a refactoring" patch is that when the frame data is moved, the cairo surface does not know about it. For example, this might happen in GIFImageDecoder.cpp:90

m_frameBufferCache.resize(reader.images_count); // Reallocates all pixel buffers.

When the following image is loaded:

<html><body>
<img width=90 height=90 src="http://images.animationfactory.com/thw/thw17/AFCT/20090116/FTW/i/12316066.gif?mouse_new_year_2012_confetti_sm_wm">
</body></html>

This was actually a really nasty bug, which took me a week to debug it, because it unfortunately disappears when valgrind or gdb is used. It appears only when WebKit execution is fast and the download speed is slow (some kind of racing condition). So I finally used mprotect to force a crash on memory write.

A draft patch is attached, but I don't think this is the best solution. Any suggestions are welcome.
Comment 1 Martin Robinson 2013-03-01 09:35:03 PST
Thanks for finding this! I think it's probably safe to just rip out NativeImageCairo altogether. That's the patch you are referring to right? I think that we can instead do all per-backend caching via Cairo surface data and PlatformContextCairo.
Comment 2 Zoltan Herczeg 2013-03-01 14:19:09 PST
Sorry, I forgot the bug report:
https://bugs.webkit.org/show_bug.cgi?id=83611

What do you mean by ripping out NativeImageCairo? What should we use instead?
Comment 3 Martin Robinson 2013-03-05 10:25:43 PST
(In reply to comment #2)
> Sorry, I forgot the bug report:
> https://bugs.webkit.org/show_bug.cgi?id=83611
> 
> What do you mean by ripping out NativeImageCairo? What should we use instead?

Essentially I mean that we could replace the NativeImageCairo class with a pointer to a raw cairo_surface_t in the way it was before I introduced NativeImageCairo.
Comment 4 Zoltan Herczeg 2013-03-08 05:55:34 PST
> Essentially I mean that we could replace the NativeImageCairo class with a pointer to a raw cairo_surface_t in the way it was before I introduced NativeImageCairo.

I realized this will not help. The problem is, that ImageFrame has a Vector<PixelData> m_backingStore member, which is reallocated when the frame is copied, so the original buffer becomes invalid. However, the image surface created by the ImageFrame still uses the old buffer. The point is, in case of resize (and only in this case!) the m_backingStore should keep its value.
Comment 5 Zoltan Herczeg 2013-03-12 03:19:35 PDT
Created attachment 192693 [details]
patch

This is the best solution I can come up with. Please review.
Comment 6 Martin Robinson 2013-03-18 10:42:21 PDT
There are a couple things that I don't understand here. The first is why simply reverting my original patch doesn't fix the issue, if it did cause it. The second is why this isn't a problem for other platforms.
Comment 7 Zoltan Herczeg 2013-03-18 11:46:41 PDT
(In reply to comment #6)
> There are a couple things that I don't understand here. The first is why simply reverting my original patch doesn't fix the issue, if it did cause it. The second is why this isn't a problem for other platforms.

Q1: the surface points directly to the data() member of m_backingStore. So it doesn't matter if we have a NativeImage or we simply use a cairo surface, the issue is that the rgba array is freed. I don't think your patch caused it, perhaps just noone noticed that before.

Q2: Because they probably duplicate the buffer. That is safer, but the consumed memory is duplicated as well. I like the approach of cairo, it is clever.
Comment 8 Noam Rosenthal 2013-04-26 02:26:21 PDT
(In reply to comment #6)
> There are a couple things that I don't understand here. The first is why simply reverting my original patch doesn't fix the issue, if it did cause it. The second is why this isn't a problem for other platforms.

The same problem occurs for Qt.
Comment 9 Noam Rosenthal 2013-04-26 02:27:29 PDT
Comment on attachment 192693 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=192693&action=review

How about using a SegmentedVector for m_frameBufferCache instead?

> Source/WebCore/platform/image-decoders/ImageDecoder.h:245
> +#if USE(CAIRO)

Since this occurs for Qt as well, maybe we should have something more general.
Comment 10 Zoltan Herczeg 2013-05-03 03:50:50 PDT
> How about using a SegmentedVector for m_frameBufferCache instead?

Needs a lot of changes. Iterators, missing members (such as resize), etc. Probably most of them are trivial, but there are many of them. Do you think it is worth it?
Comment 11 Allan Sandfeld Jensen 2013-06-13 02:46:37 PDT
As far as I can tell, Qt now does a full copy of the data when creating a QPixmap, so we don't have the same bug currently.
Comment 12 Noam Rosenthal 2013-06-18 03:06:20 PDT
Martin, could you give this another look?
Comment 13 Martin Robinson 2013-06-20 12:27:28 PDT
(In reply to comment #11)
> As far as I can tell, Qt now does a full copy of the data when creating a QPixmap, so we don't have the same bug currently.

If both Qt and CG are doing this, it's probably much safer to follow the crowd. While this might be a minor optimization in Cairo's case, it seems that we should maintain consistent class behavior across platforms. If CG doesn't copy the buffer, we probably need to understand why it isn't an issue there.
Comment 14 Anton Obzhirov 2013-07-08 08:00:39 PDT
I've tried this patch for https://bugs.webkit.org/show_bug.cgi?id=16200 - seems to fix the issue. However I think it might be not the optimal solution as it involves extra copying of the image buffer. Did you guys consider some extra referencing mechanism for the data inside image frame?
Comment 15 Zoltan Herczeg 2013-07-08 23:39:19 PDT
(In reply to comment #14)
> I've tried this patch for https://bugs.webkit.org/show_bug.cgi?id=16200 - seems to fix the issue. However I think it might be not the optimal solution as it involves extra copying of the image buffer. Did you guys consider some extra referencing mechanism for the data inside image frame?

No, it actually makes the extra copy unnecessary, which happens when the frames array is resized. The copy constructor is the not optimal solution in this case.
Comment 16 Andreas Kling 2014-02-05 11:16:48 PST
Comment on attachment 192693 [details]
patch

Clearing review flag on patches from before 2014. If this patch is still relevant, please reset the r? flag.
Comment 17 Ed Catmur 2014-04-10 09:22:28 PDT
Is this patch likely to be accepted? The only alternative I can see is on Cairo (and only on Cairo) to replace m_backingStore with a PassNativeImagePtr that gets refcount-incremented when the ImageFrame is copied (when the frame buffer cache is resized).  I can code up an alternative patch if it would be preferred.
Comment 18 Ed Catmur 2014-04-10 10:17:21 PDT
Created attachment 229057 [details]
refcount-imageframe-data.patch

Alternative 1: use a refcounted Cairo image surface
Comment 19 Martin Robinson 2014-04-10 10:20:16 PDT
This reason this hasn't been fixed is that it's simply missing a minimal reproducible test case. If someone could indicate how to reproduce this issue I'd be happy to take a closer look.
Comment 20 Ed Catmur 2014-04-10 10:20:37 PDT
Created attachment 229059 [details]
imageframe-use-refcounted-array.patch

Alternative 2: use a RefCountedArray instead of Vector to hold the image data

This is probably more straightforward.  copyImageData is a little inefficient (it unnecessarily memsets the array) but we'd need to add to RefCountedArray (maybe a clone() method) to fix that.
Comment 21 Ed Catmur 2014-04-10 10:25:23 PDT
(In reply to comment #19)
> This reason this hasn't been fixed is that it's simply missing a minimal reproducible test case. If someone could indicate how to reproduce this issue I'd be happy to take a closer look.

This page reliably does it for me: http://math.stackexchange.com/questions/733754/visually-stunning-math-concepts-which-are-easy-to-explain

Essentially you need a page with large animations with plenty of frames and where the frame buffer cache gets resized while the animation is drawing.
Comment 22 Martin Robinson 2014-04-10 10:48:00 PDT
(In reply to comment #21)

> Essentially you need a page with large animations with plenty of frames and where the frame buffer cache gets resized while the animation is drawing.

I cannot reproduce just by viewing the page in MiniBrowser. Do I need a particular port or version of Cairo to see the crash?
Comment 23 Ed Catmur 2014-04-10 14:52:28 PDT
(In reply to comment #22)
> I cannot reproduce just by viewing the page in MiniBrowser. Do I need a particular port or version of Cairo to see the crash?

Shouldn't think so - the crash occurs when Cairo reads from freed memory, so it won't make any difference what the version of Cairo is.

It's more likely to depend on what allocator you're using - it might help to switch to the system allocator (#define USE_SYSTEM_MALLOC 1). You also need to give the allocator a chance to return pages to the OS; either that or switch to an allocator that poisons freed memory.
Comment 24 Zoltan Herczeg 2014-04-11 00:12:41 PDT
You need an animated image with more than one frame to see invalid result or crash. The default size of image list is 1, since most images are still images, so two frames are enough. However, if the resize happens in place, or the resized area contains pointers to valid memory regions (already mapped to the process address space) you will not see the crash. In the former case, the engine seems work correctly.

I would suggest a small (8x8) animated gif with 100+ frames (to avoid in place resizes), and do a pixel dump (invalid output can be detected).
Comment 25 cand 2014-06-16 01:19:50 PDT
(In reply to comment #20)
> Created an attachment (id=229059) [details]
> imageframe-use-refcounted-array.patch
> 
> Alternative 2: use a RefCountedArray instead of Vector to hold the image data

I'm using this patch, and google.com still randomly crashes. Doesn't really help that they change the gif each day.

Segfault in ImageFrame::copyBitmapData on the memcpy line. Only happens standalone and in gdb, I could not get it to happen in valgrind.

It's called from the resize line in GIFImageDecoder, but the backtrace is not really useful beyond that.
Comment 26 cand 2014-06-16 01:58:07 PDT
The alternative#2 patch also had visible corruption in the gif before the crash.

The "make a copy" approach from https://bugs.webkit.org/show_bug.cgi?id=16200#c11 works, but drops performance to unacceptably low numbers. There is no corruption or crash, but the gif plays at 1/4 the speed it should. Next I will test alt#1.
Comment 27 cand 2014-06-16 02:02:26 PDT
(In reply to comment #18)
> Created an attachment (id=229057) [details]
> refcount-imageframe-data.patch
> 
> Alternative 1: use a refcounted Cairo image surface

This patch crashes in the same place as alt#2.
Comment 28 cand 2014-06-16 02:04:31 PDT
(In reply to comment #27)
> (In reply to comment #18)
> > Created an attachment (id=229057) [details] [details]
> > refcount-imageframe-data.patch
> > 
> > Alternative 1: use a refcounted Cairo image surface
> 
> This patch crashes in the same place as alt#2.

Sorry, clarifying: it also crashes in memcpy called from resize. Obviously the memcpy is in a different place.
Comment 29 cand 2014-06-16 02:23:48 PDT
Created attachment 233153 [details]
Workaround patch

This is what I will deploy for now. No crashes, and once the gif is loaded, normal performance.

It disables the 5mb/ios 2mb animated limit, removing memcopies from runtime, but using more RAM. The GIF decoder is simply too buggy to handle the runtime resize in its current state.

It also seems to decode all images fully, when only a small part may be needed. But there are other bugs open on the optimization of the gif decoder.
Comment 30 Martin Robinson 2014-06-16 08:46:13 PDT
Comment on attachment 233153 [details]
Workaround patch

View in context: https://bugs.webkit.org/attachment.cgi?id=233153&action=review

> Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:39
> +    cairo_surface_t* surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, width(), height());
> +    unsigned char* data = cairo_image_surface_get_data(surface);
> +    memcpy(data, m_bytes, width() * height() * sizeof(PixelData));
> +    cairo_surface_mark_dirty(surface);
> +    return adoptRef(surface);

I don't know what to say about the patch, but I do know this is wrong because you aren't taking into account the image surface stride.
Comment 31 Build Bot 2014-06-16 16:13:17 PDT
Comment on attachment 233153 [details]
Workaround patch

Attachment 233153 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/6613769119596544

New failing tests:
compositing/absolute-inside-out-of-view-fixed.html
animations/3d/change-transform-in-end-event.html
accessibility/accessibility-node-memory-management.html
animations/added-while-suspended.html
canvas/philip/tests/2d.canvas.readonly.html
Comment 32 Build Bot 2014-06-16 16:13:22 PDT
Created attachment 233200 [details]
Archive of layout-test-results from webkit-ews-10 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: webkit-ews-10  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.5
Comment 33 Jeremy Zerfas 2015-08-09 09:50:53 PDT
(In reply to comment #22)
> I cannot reproduce just by viewing the page in MiniBrowser. Do I need a
> particular port or version of Cairo to see the crash?

The occurrence of these crashes also seem to be influenced by the bit rate of the GIF, the transfer rate from the server, and whether the GIF is offscreen while it's loading which might explain why you're having trouble reproducing these crashes. At least that seems to be the case based on my testing of WinLauncher.exe from a release build of the Windows Cairo port on Windows 10 x86_64.

The GIF linked to from the original bug is pretty small and doesn't seem to reliably produce any crashes. However if the transfer rate is limited to less than 200 KBps or so it does seem to reliably produce a corrupt animation (images are offset or have random noise in them).

The animated GIFs on the Stack Exchange page won't cause any problems either unless the window size is large enough to make the animated GIF in the first answer visible while the page is loading.

I've made a spinning cube animation that should reliably demonstrate this problem, you can view it at http://www.jeremyzerfas.com/WebKit/Cairo_Animated_GIF_Crashing/Spinning_Cube_500x500.gif . To demonstrate that the crashes only occur when the animated GIFs are onscreen, I've also created a page at http://www.jeremyzerfas.com/WebKit/Cairo_Animated_GIF_Crashing/Spinning_Cube_Page_with_Animation_at_the_Bottom.html that includes the same animation but shows it on a long page that you need to scroll down in order to make the animation visible. I'll attach the two files in case I remove these files later but keep in mind that the crashes also don't seem to occur if the animation is loaded from the file system instead of from a web server.

Zoltan's patch in attachment 192693 [details] seems to work well for fixing this bug and looks like a good fix for now.
Comment 34 Jeremy Zerfas 2015-08-09 09:53:11 PDT
Created attachment 258593 [details]
Spinning cube files for demonstrating crashes.
Comment 35 Martin Robinson 2016-03-07 14:38:52 PST
Created attachment 273214 [details]
Patch
Comment 36 Martin Robinson 2016-03-07 14:39:46 PST
Jeremy, sorry for the super late response. Do you mind testing the patch that I have posted? Copying the memory might be a performance hit, but avoiding the copying is a much more complicated task and it's better to prevent crashes now.
Comment 37 Darin Adler 2016-03-08 09:27:48 PST
Comment on attachment 273214 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=273214&action=review

Seems really sad and bad for performance to be copying the image data all the time instead of being able to reference-count the already allocated buffer so in the normal case no copying is necessary. But that’s presumably a bigger project.

> Source/WebCore/ChangeLog:11
> +        No new tests. This issue is incredibly illusive and I haven't been
> +        able to reproduce it locally. It depends very much on the network speed
> +        and the size of the animated GIF. The code in question seems obviously
> +        wrong.

Typo: we mean "elusive" instead of "illusive"

Sad to hear we weren’t able to create a test.

> Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:36
> +    PixelData* copiedData = static_cast<PixelData*>(fastMalloc(width() * height() * sizeof(PixelData)));
> +    memcpy(copiedData, m_bytes, width() * height() * sizeof(PixelData));

Might consider new/delete instead of fastMalloc/fastFree.

    PixelData* copiedData = new PixelData[width() * height()];

Then, below:

    delete [] static_cast<PixelData*>(data);

Might also consider putting the fastMalloc closer to the fastFree by allocating the copied data *after* creating the nativeImage.

> Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:40
> +    RefPtr<cairo_surface_t> nativeImage = adoptRef(cairo_image_surface_create_for_data(
> +        reinterpret_cast<unsigned char*>(const_cast<PixelData*>(copiedData)),
>          CAIRO_FORMAT_ARGB32, width(), height(), width() * sizeof(PixelData)));

Can this ever return null? The old code would return nullptr, but the new code will crash in this case.

I also suggest auto for the local variable type here rather than explicitly stating RefPtr<cairo_surface_t>.
Comment 38 Martin Robinson 2016-03-08 10:07:29 PST
(In reply to comment #37)
> Comment on attachment 273214 [details]
> Patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=273214&action=review
> 
> Seems really sad and bad for performance to be copying the image data all
> the time instead of being able to reference-count the already allocated
> buffer so in the normal case no copying is necessary. But that’s presumably
> a bigger project.

Thanks for the review. I'm going to hold off committing until I can confirm that this fixes the original issue (probably I should have prevented the r? flag from activating). As a side note, I have investigated using reference counting to preserve the memory. It is possible, but it has potential negatively implications for image decoding performance elsewhere. 

The issue is that currently the image decoder first tries to resize the backing store vector when the frame size changes. I believe that Vector can sometimes do this without reallocating, but if we use a shared block of memory we always need to reallocate. I'm not sure what is the correct performance tradeoff. 
 
> 
> > Source/WebCore/ChangeLog:11
> > +        No new tests. This issue is incredibly illusive and I haven't been
> > +        able to reproduce it locally. It depends very much on the network speed
> > +        and the size of the animated GIF. The code in question seems obviously
> > +        wrong.
> 
> Typo: we mean "elusive" instead of "illusive"

Yikes! Thanks.

> Sad to hear we weren’t able to create a test.
> 
> > Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:36
> > +    PixelData* copiedData = static_cast<PixelData*>(fastMalloc(width() * height() * sizeof(PixelData)));
> > +    memcpy(copiedData, m_bytes, width() * height() * sizeof(PixelData));
> 
> Might consider new/delete instead of fastMalloc/fastFree.
> 
>     PixelData* copiedData = new PixelData[width() * height()];
> 
> Then, below:
> 
>     delete [] static_cast<PixelData*>(data);
> 
> Might also consider putting the fastMalloc closer to the fastFree by
> allocating the copied data *after* creating the nativeImage.
> 
> > Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:40
> > +    RefPtr<cairo_surface_t> nativeImage = adoptRef(cairo_image_surface_create_for_data(
> > +        reinterpret_cast<unsigned char*>(const_cast<PixelData*>(copiedData)),
> >          CAIRO_FORMAT_ARGB32, width(), height(), width() * sizeof(PixelData)));
> 
> Can this ever return null? The old code would return nullptr, but the new
> code will crash in this case.
> 
> I also suggest auto for the local variable type here rather than explicitly
> stating RefPtr<cairo_surface_t>.

Okay. I will make all these changes.
Comment 39 cand 2016-03-08 10:28:20 PST
Martin, I find it funny you complained about my patch not taking stride into account, and now your patch, which is clearly based on mine, does not do so either.

I've used my patch for over a year now without issues.

FWIW, the issue was very easy to trigger for me, taking just seconds and a couple refreshes of google.com.
Comment 40 Martin Robinson 2016-03-08 10:43:26 PST
(In reply to comment #39)
> Martin, I find it funny you complained about my patch not taking stride into
> account, and now your patch, which is clearly based on mine, does not do so
> either.

After looking more closely at this code and understanding the problem more thoroughly, it's clear that strides that are not 32-bits per pixel will fail anyway. We should likely have a compile-time check in the image decoder code for this, if we can. Sorry for the misunderstanding. I'm going to take a look at your reference counting patch again though.
Comment 41 Martin Robinson 2016-03-09 19:02:04 PST
(In reply to comment #37)
> Comment on attachment 273214 [details]
> Patch

> Might also consider putting the fastMalloc closer to the fastFree by
> allocating the copied data *after* creating the nativeImage.

I'm actually not sure this is possible. We need to have a pointer to the data when we call cairo_image_surface_create_for_data to construct the surface. There also is no way (as far as I know) to create the surface and then set the data later.
 
> > Source/WebCore/platform/image-decoders/cairo/ImageDecoderCairo.cpp:40
> > +    RefPtr<cairo_surface_t> nativeImage = adoptRef(cairo_image_surface_create_for_data(
> > +        reinterpret_cast<unsigned char*>(const_cast<PixelData*>(copiedData)),
> >          CAIRO_FORMAT_ARGB32, width(), height(), width() * sizeof(PixelData)));
> 
> Can this ever return null? The old code would return nullptr, but the new
> code will crash in this case.

A creation failure should return a Cairo error surface, so this shouldn't crash.
Comment 42 Martin Robinson 2016-03-09 19:08:23 PST
Created attachment 273528 [details]
Patch
Comment 43 Jeremy Zerfas 2016-03-12 07:36:30 PST
(In reply to comment #36)
> Do you mind testing the patch that I have posted?

I've tested the patches in attachment 273214 [details] and attachment 273528 [details] and both patches do work for fixing the crashes.
Comment 44 Michael Catanzaro 2017-03-06 10:34:40 PST
*** Bug 16200 has been marked as a duplicate of this bug. ***
Comment 45 Michael Catanzaro 2017-03-06 10:34:48 PST
*** Bug 48895 has been marked as a duplicate of this bug. ***
Comment 46 Michael Catanzaro 2017-03-06 10:35:00 PST
*** Bug 80754 has been marked as a duplicate of this bug. ***
Comment 47 Michael Catanzaro 2017-03-06 10:37:53 PST
The patch looks correct. Martin, were you planning to commit this? I guess it's probably rotted a bit by now, but couldn't hurt to try commit-queue...?