Bug 13203 - REGRESSION: Repro crash in -[WebBaseNetscapePluginView(WebNPPCallbacks) destroyStream:reason:] navigating away from page with DivX movie plug-in
Summary: REGRESSION: Repro crash in -[WebBaseNetscapePluginView(WebNPPCallbacks) destr...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 523.x (Safari 3)
Hardware: Mac OS X 10.4
: P1 Major
Assignee: Geoffrey Garen
URL: http://patriciawarwick.com.hosting.do...
Keywords: InRadar, Regression
Depends on:
Blocks:
 
Reported: 2007-03-27 06:31 PDT by Patricia Warwick
Modified: 2007-03-29 12:05 PDT (History)
1 user (show)

See Also:


Attachments
Safari Crash Log (143.61 KB, text/plain)
2007-03-27 08:54 PDT, Patricia Warwick
no flags Details
patch (7.91 KB, patch)
2007-03-27 19:21 PDT, Geoffrey Garen
darin: review-
Details | Formatted Diff | Diff
patch (8.01 KB, patch)
2007-03-28 17:26 PDT, Geoffrey Garen
darin: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Patricia Warwick 2007-03-27 06:31:47 PDT
I am attempting to test embedding a DIVX movie into a web page. This works correctly in Safari, but when I try to load the same page under Webkit two things happen.

1) Webkit loads an older version of the page (presumably from cache) and as a result cannot find the DIVX file.

2) When I click on the Reload button on the toolbar, Webkit crashes.
Comment 1 David Kilzer (:ddkilzer) 2007-03-27 07:59:06 PDT
Hi Patricia, please post the stack trace to this bug.

Also, if possible, please give the URL of the page causing the crash, or the URL of the main web site if it's a password-protected page.  Thanks!

Comment 2 Patricia Warwick 2007-03-27 08:54:44 PDT
Created attachment 13829 [details]
Safari Crash Log
Comment 3 Patricia Warwick 2007-03-27 08:56:41 PDT
The crash occurs when I reload this page: http://patriciawarwick.com.hosting.domaindirect.com/viewmovie.html
Comment 4 Mark Rowe (bdash) 2007-03-27 09:46:53 PDT
I can reproduce this crash with ToT.  With r19968 I do not experience a crash, but after reloading the page then closing the tab Safari starts using a *huge* amount of memory.  I killed it after it reached 1.5GB, but reproduced the behaviour a second time.

I'm seeing what looks to be user-controlled data being dereferenced which triggers the crash.  Patricia's crash log shows several instances of crashing while dereferencing 0x50504320 ("PPC ") and I have seen other values such as 0xf97223d1 or 0x56e58975.  The two most common address I am seeing dereferenced causing the crash are 0xc0000000 and 0x00000020.

I have also seen a third failure mode that appears at random:

2007-03-28 02:27:29.057 Safari[39550:117] *** -[NSConcreteMutableData errorForReason:]: selector not recognized [self = 0x187f6350]
ASSERTION FAILED: Uncaught exception - *** -[NSConcreteMutableData errorForReason:]: selector not recognized [self = 0x187f6350]
0
(WebCore/platform/mac/BlockExceptions.mm:36 void ReportBlockedObjCException(NSException*))

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xbbadbeef
0x0132a9b3 in ReportBlockedObjCException (localException=0x181a6a60) at WebCore/platform/mac/BlockExceptions.mm:36
36	    ASSERT_WITH_MESSAGE(0, "Uncaught exception - %@", localException);
(gdb) 

Comment 5 Mark Rowe (bdash) 2007-03-27 09:48:35 PDT
<rdar://problem/5091330>
Comment 6 Mark Rowe (bdash) 2007-03-27 09:50:17 PDT
And after posting that comment I saw a crash at 0x65707954 ("epyT").  It seems like either we or the divx.com plugin are accessing memory which has been deallocated and repurposed :-(
Comment 7 Geoffrey Garen 2007-03-27 14:38:16 PDT
Changing title to focus on the most important issue -- the crash. Once we fix that, we should re-check for other problems.
Comment 8 Geoffrey Garen 2007-03-27 14:38:41 PDT
I'm working on this.
Comment 9 Geoffrey Garen 2007-03-27 15:15:16 PDT
You need the DivX player and plug-in to repro: http://www.divx.com/divx/mac/
Comment 10 Geoffrey Garen 2007-03-27 15:22:29 PDT
Crash also happens if you just navigate away from the page.
Comment 11 Geoffrey Garen 2007-03-27 15:48:10 PDT
Ironically, MallocScribble fixes this crash, so be sure to turn it off!
Comment 12 Geoffrey Garen 2007-03-27 15:50:39 PDT
Strike that. It's NSZombieEnabled that fixes this crash.
Comment 13 Geoffrey Garen 2007-03-27 15:57:18 PDT
The interesting part of the crash is here:
#0	0x0041a028 in -[WebBaseNetscapePluginView(WebNPPCallbacks) destroyStream:reason:] at WebBaseNetscapePluginView.mm:2323
#1	0x00426ebd in NPN_DestroyStream at npapi.m:110
#2	0x17907c38 in nsDivXBrowserPlugin::cancelDownload
#3	0x17914d5f in DivXBrowserPlugin::Close
#4	0x1791698a in DivXBrowserPlugin::~DivXBrowserPlugin
#5	0x17909596 in nsDivXBrowserPlugin::~nsDivXBrowserPlugin
#6	0x1790dc94 in Mozilla_NPP_Destroy
#7	0x1790e06b in NPP_Destroy
#8	0x0041ab90 in -[WebBaseNetscapePluginView(Internal) _destroyPlugin] at WebBaseNetscapePluginView.mm:2593

When we destroy the DivX plug-in, it cancels its plug-in stream. However, the stream it passes to WebKit to cancel is either already destroyed or complete garbage:

p stream
$10 = (NPStream *) 0x178b5b10
(gdb) p *(WebBaseNetscapePluginStream *)stream->ndata
$11 = {
  {
    isa = 0x0
  }, 
[...many more NULL fields...]
Comment 14 Geoffrey Garen 2007-03-27 16:16:27 PDT
According to MallocScribble, the NPStream* the plug-in passes to NPN_DestroyStream points to freed memory.
Comment 15 Geoffrey Garen 2007-03-27 17:35:30 PDT
gdb confirms that this is an already destroyed stream, not just random garbage:

NPP_DestroyStream: 0x17794180, {pdata = 0x17554e70, ndata = 0x17794160, url = 0x17738d90 "http://patriciawarwick.com.hosting.domaindirect.com/weaving.divx", end = 52260076, lastmodified = 1175003384, notifyData = 0x1}
NPN_DestroyStream: 0x17794180, {pdata = 0x55555555, ndata = 0x55555555, url = 0x55555555 <Address 0x55555555 out of bounds>, end = 1431655765, lastmodified = 1431655765, notifyData = 0x55555555}

So, we notify the plug-in that we're destroying an NPStream (NPP_DestroyStream), and then we destroy it, and subsequently the plug-in asks us to destroy it again (NPN_DestroyStream). This seems like a plug-in bug that we'll have to work around.
Comment 16 Mark Rowe (bdash) 2007-03-27 17:39:01 PDT
Geoff, the initial report indicates this works correctly in Safari 2.0.4.  I recall we had some fixes recently related to freeing plugin streams earlier to prevent holding on to their memory for too long -- perhaps the fix is incorrect, or alternatively the plugin could have been making bogus assumptions about the lifetime of the stream.
Comment 17 Geoffrey Garen 2007-03-27 19:16:00 PDT
Based on Anders' comments, I believe we never experienced this bug before because we never destroyed plug-in streams, period. That's why I think we didn't crash before. 

NPP_DestroyStream is the callback the browser is supposed to make to notify a plug-in that a stream is dead (http://developer.mozilla.org/en/docs/Gecko_Plugin_API_Reference:Plug-in_Side_Plug-in_API). The gdb log I posted above shows that the plug-in makes its NPN_DestroyStream call after the browser has called NPP_DestroyStream (it turns out to be long after, actually), so I think this bug is about a bogus assumption in the plug-in. Specifically, I think the plug-in's destructor unconditionally destroys its stream, without checking whether the stream has been destroyed already.
Comment 18 Geoffrey Garen 2007-03-27 19:21:23 PDT
Created attachment 13839 [details]
patch

This patch guards against bogus NPStreams in plug-in callbacks.
Comment 19 Darin Adler 2007-03-27 21:51:26 PDT
Comment on attachment 13839 [details]
patch

This fix is incorrect. There's no guarantee that we won't create a new stream with the same pointer value as an old stream. While this change greatly decreases the chance of the bug happening, it would be better to find a fix that fixes the problem completely.

To do a correct fix, we need to understand more about exactly what the DivX plug-in is doing. Does it destroy the same stream twice? Does it destroy a stream after we destroy it?

A correct fix for this will presumably involve keeping the NPStream objects around for a longer time.

+            char* filename = reinterpret_cast<char*>(malloc(bufSize));

These kinds of casts are done with static_cast, not reinterpret_cast. The malloc function returns a void* and you can and should cast from one of those without using reinterpret_cast.
Comment 20 Darin Adler 2007-03-27 22:12:53 PDT
(In reply to comment #19)
> (From update of attachment 13839 [details] [edit])
> This fix is incorrect. There's no guarantee that we won't create a new stream
> with the same pointer value as an old stream. While this change greatly
> decreases the chance of the bug happening, it would be better to find a fix
> that fixes the problem completely.
> 
> To do a correct fix, we need to understand more about exactly what the DivX
> plug-in is doing. Does it destroy the same stream twice? Does it destroy a
> stream after we destroy it?
> 
> A correct fix for this will presumably involve keeping the NPStream objects
> around for a longer time.

I'm sorry. I should have read the bug report more carefully before reviewing. Your comments indicate that you consider this a guard against incorrect code in the DivX plug-in. In that context, I suppose this fix might be acceptable. But I'd like to understand why this is not a problem in other browsers.

If this is really a workaround for incorrect plug-in code, then it should write out a log message in debug versions.

I'd also like to understand who destroyed the stream in the first place. Did we destroy the stream? Why? Was that correct? It's possible that failures are not supposed to destroy the stream, and it's our code that's incorrect.

I'm suspicious that we don't yet fully understand what's happening. Protecting against use a bad pointer like this should be a last resort. Do we really know that this is a bug in the plug-in and not something we're doing wrong?

A cleaner way to do the workaround would be to close streams, but not actually deallocate the stream object itself until the plug-in is deallocated. This would have the downside that the stream objects would leak while a plug-in was working, but the upside that it would not be susceptible to aliasing the way the attached fix is. By making a closed stream discard most of its state, the cost for each stream could be a single small memory block.

Or possibly this is the correct semantic. It may be that we are destroying streams when really we should just be putting them in a closed state. It's possible that only the plug-in itself is supposed to actually be able to deallocate a stream by calling NPN_DestroyStream.
Comment 21 Geoffrey Garen 2007-03-27 23:11:02 PDT
> This fix is incorrect. There's no guarantee that we won't create a new stream with the same pointer value as an old stream. 

Streams are per-plug-in. So, in the unlikely aliasing case, a buggy plug-in would just end up canceling one of its own streams. We could go to great lengths to accomodate that sort of bug, but I don't know of any real-world compatibility problems that would solve. I'm sympathetic to your desire to make this patch perfect. I have that desire too. I'm just not sure it's the right thing to do here.

> Does it destroy the same stream twice? 

No.

> Does it destroy a stream after we destroy it?

Yes.

> I'd like to understand why this is not a problem in other browsers.

I didn't see an obvious answer to this question in the FF source code. It could be random chance regarding when other browsers destroy plug-in streams. It's also possible that other browsers simply don't implement NPN_DestroyStream. I know of at least two other browsers that don't.

A simpler solution to this bug might be to remove our implementation of NPN_DestroyStream. I went back through SVN history, and it doesn't look like it was added for any specific reason. I didn't go that route because I thought it was a bit of a blind leap, but it's an option.

> If this is really a workaround for incorrect plug-in code, then it should write out a log message in debug versions.

Sounds good.

> Did we destroy the stream? 

Yes.

> Why? 

Because the user navigated away from the page, we canceled all loaders.

> Was that correct? 

That depends. There's no documentation saying it's incorrect. Certainly the browser needs to cancel loaders at some time. Page tear-down seems like a reasonable time. We could go out of our way to destroy all plug-ins before canceling loaders. That would be another work-around, but I think it would be hard to keep that sort of dependency in sync.

> It's possible that failures are not supposed to destroy the stream, and it's our code that's incorrect.

The Mozilla documentation says this:

"The browser allocates and initializes the NPStream object... The browser cannot delete the object until after it calls NPP_DestroyStream or the plug-in calls NPN_DestroyStream... The browser informs the plug-in when the stream is about to be deleted through NPP_DestroyStream, after which the NPStream object is no longer valid."

This is a browser-created stream. My reading of the documentation is that either the plug-in or the browser can decide when the stream is no longer needed, but it's ultimately up to the browser to decide when to free the memory. According to gdb, we correctly call NPP_DestroyStream to notify that plug-in that the stream is being destroyed. That's why I think the plug-in's later use of the stream is an error.

> Do we really know that this is a bug in the plug-in and not something we're doing wrong?

Yes. A plug-in should not use a stream after it receives the NPP_DestroyStream callback.

> A cleaner way to do the workaround would be to close streams, but not actually deallocate the stream object itself until the plug-in is deallocated.

I'm not sure I would call this implementation "cleaner." I would call it more complicated, with the benefit of guaranteeing that a buggy plug-in couldn't accidentally cancel one of its own streams. Do you think the benefit is worth the cost?

> Or possibly this is the correct semantic. It may be that we are destroying streams when really we should just be putting them in a closed state. 

I don't think the documentation supports that claim.

> It's possible that only the plug-in itself is supposed to actually be able to deallocate a stream by calling NPN_DestroyStream.

I don't think the documentation supports that claim, either.
Comment 22 Darin Adler 2007-03-28 09:45:22 PDT
> > I'd like to understand why this is not a problem in other browsers.
> 
> I didn't see an obvious answer to this question in the FF source code. It could
> be random chance regarding when other browsers destroy plug-in streams.

What you call "random chance" here could be the essence of the real problem.

> A simpler solution to this bug might be to remove our implementation of
> NPN_DestroyStream. I went back through SVN history, and it doesn't look like it
> was added for any specific reason. I didn't go that route because I thought it
> was a bit of a blind leap, but it's an option.

Seems worth considering.

> Because the user navigated away from the page, we canceled all loaders.

Do other browsers destroy streams before destroying plug-ins when navigating away from a page?

> > Was that correct? 
> 
> That depends. There's no documentation saying it's incorrect. Certainly the
> browser needs to cancel loaders at some time. Page tear-down seems like a
> reasonable time. We could go out of our way to destroy all plug-ins before
> canceling loaders. That would be another work-around, but I think it would be
> hard to keep that sort of dependency in sync.

I don't agree with your assessment here. It doesn't sound hard to destroy the plug-ins before destroying the plug-in streams.

> > Do we really know that this is a bug in the plug-in and not something we're doing wrong?
> 
> Yes. A plug-in should not use a stream after it receives the NPP_DestroyStream
> callback.

But perhaps there's an implicit contract that the browser won't do this. And perhaps no other browser does it. In which case I'm not sure I'd call this a bug in the plug-in.

> > A cleaner way to do the workaround would be to close streams, but not actually deallocate the stream object itself until the plug-in is deallocated.
> 
> I'm not sure I would call this implementation "cleaner." I would call it more
> complicated, with the benefit of guaranteeing that a buggy plug-in couldn't
> accidentally cancel one of its own streams. Do you think the benefit is worth
> the cost?

I don't agree with your assessment of the cost. I think having a notion of a closed stream is very simple to implement. I strongly prefer a solution that does not involve a pointer validity check.
Comment 23 Geoffrey Garen 2007-03-28 10:55:25 PDT
Anders found this comment in the Mozilla source code, which seems to confirm that their behavior truly is random. It looks like Firefox may never release browser-created streams:

1180     // FIXME: http://bugzilla.mozilla.org/show_bug.cgi?id=240131
1181     //
1182     // Is it ok to leave pstream->ndata set here, and who releases it
1183     // (or is it even properly ref counted)? And who closes the stream
1184     // etc?

(http://lxr.mozilla.org/seamonkey/source/modules/plugin/base/src/ns4xPlugin.cpp)
Comment 24 Geoffrey Garen 2007-03-28 13:29:34 PDT
After further reflection, I don't think we want to eliminate NPN_DestoryStream. It's an important networking optimization in the case of a plug-in that wants to do its own loading. The plug-in first calls NPN_DestroyStream to cancel the browser's loading. In that case, we don't want to continue the browser's loading, slowing down the real load.
Comment 25 Geoffrey Garen 2007-03-28 13:32:24 PDT
If we decide to keep NPStream objects around for the lifetime of a plug-in, I think we need to answer the question of what to do in the case of, e.g., a Flash-based email client, which might create hundreds or even thousands of streams in one sitting. Do we leak all that memory? Cut it off at some limit? Use a white list or black list?
Comment 26 Geoffrey Garen 2007-03-28 16:44:06 PDT
> I don't agree with your assessment here. It doesn't sound hard to destroy the
> plug-ins before destroying the plug-in streams.

That seems really non-trivial to me. We can't make stream destruction destroy a plug-in because it's normal for a plug-in to outlast a stream. We can't make the code that destroys all streams destroy all plug-ins, since we wouldn't want stopping a page load, for example, to have the side-efect of unloading the page's plug-ins.

It seems to me that we would need to re-order the loader state machine to wait until a page had torn down before canceling its loaders. In addition to being very tricky, that solution would degrade networking performance in the case of a navigation that interrupted another navigation, since both navigations would remain live until the new navigation became committed.
Comment 27 Geoffrey Garen 2007-03-28 17:26:06 PDT
Created attachment 13855 [details]
patch

This patch eschews an "Is this stream valid?" check in favor of a "Does this stream belong to the plug-in using it?" check. So it avoids introducing a function that can create subtle bugs by promising to do more than it actually can, even though It doesn't do much more in the way of guarding against bad behavior.
Comment 28 Darin Adler 2007-03-28 17:40:25 PDT
Comment on attachment 13855 [details]
patch

You wore me down, man. r=me
Comment 29 Geoffrey Garen 2007-03-29 08:35:04 PDT
Committed revision 20571.
Comment 30 Patricia Warwick 2007-03-29 11:47:33 PDT
It is not totally clear from the  Mark's comments about the huge amounts of memory being used ... did that apply to Webkit or Safari? The person for whom I was developing this page tested it with Safari and reported that it caused his system to come to a halt. Could that be a related bug in Safari?
Comment 31 Mark Rowe (bdash) 2007-03-29 12:05:01 PDT
Patricia, my comment was about r19968 of WebKit.  It'll be worth retesting this site with Geoff's fix to see if the large memory increase is still present.  If it is, that should be filed in a new bug report.