Created attachment 241058 [details]
Crash log for the typical crashing case described here
I've been looking into a crash that affects real-world users of my app, MarsEdit. The nuance of this bug is that a customer experiences crashes in my app (which features a WebView-based rich editor) only when they embed HTML code from Microsoft's "OneDrive" service.
I have confirmed that the crash is not specific to my app by reproducing it easily with a trunk build of WebKit, running MiniBrowser in WebKitLegacy mode:
1. Build WebKit from the command line: ./Tools/Scripts/build-webkit --debug
2. Run ./Tools/Scripts/run-minibrowser --debug
3. Open a new window in MiniBrowser and load this URL:
3. Close the window while the view is loading or shortly after it has completed.
The bug reproduces pretty easily upon closing the window, but to make it even more likely to occur, try closing it in the midst of the view's content loading.
I'm attaching a crash log that shows the issue is typically that the WebView associated with the closed window is deallocated by the time an active NSURLConnection's delegate dispatch attempts to call back to it. I'm a little confused actually because the crash often appears to occur upon release after a retain by the NSURLConnection, so it may come down to an issue in NSURLConnection, but I saw a few things in ResourceHandleMac that had me concerned:
1. In ResourceHandle::~ResourceHandle, cancel() is not called. It seems like it should be to ensure that any pending NSURLConnection tracked by m_connection is stopped.
2. I suspect that beyond calling cancel, it may be necessary to mirror the calls to scheduleInRunLoop:... from ResourceHandle::start, with calls to unscheduleFromRunLoop:... documentation in NSURLConnection is a little contradictory but I think the safest best is to unschedule the connection from runloops it was scheduled to, to avoid any unwanted delegate dispatch after the delegate may be gone.
3. I noticed that ResourceHandleMac implements ResourceHandle::schedule and ResourceHandle::unschedule, but these methods are not reached in typical cases I've run in MiniBrowser. I wonder if these are vestigial from a past implementation that didn't rely upon the scheduling that occurs in ::start? If so they can probably be removed to help clarify the obligation of ResourceHandleMac to unschedule m_connection only from the runloops it scheduled in ::start.
Now the bad news: even after adding to ResourceHandle::~ResourceHandle a call to close() and a mirror of the code from ::start that schedules the connection on the runloop, except to unschedule from the run loop, the crashing behavior is still occurring for me. So while I think the analysis above has some truth to it, I clearly don't understand the situation well enough to completely remedy the problem.
I hope my initial work helps somebody more experienced with ResourceHandleMac to speculate about what's going on in this crash and hopefully come up with a fix. If you have advice for me to pursue creation of a patch on my own I'd be happy to help with that.
I forgot to mention as far as reproducing the bug: try opening a new window if it doesn't crash right away. In the worst case, try opening a new window and loading the same URL again, closing, repeat with subtly different timing until it crashes.
I'll need to refresh my memory on how ResourceHandleMac destruction works. In the meanwhile, it appears that there is an over-release of WebView in MiniBrowser that's unrelated to resource loading.
See what happens with zombies:
$ NSZombieEnabled=YES run-minibrowser
2014-11-05 22:13:32.630 MiniBrowser[35650:3221420] *** -[WebView release]: message sent to deallocated instance 0x169bb4f60
#0 0x00007fff8ac34980 in ___forwarding___ ()
#1 0x00007fff8ac345f8 in __forwarding_prep_0___ ()
#2 0x00007fff8c63a91f in (anonymous namespace)::AutoreleasePoolPage::pop(void*) ()
#3 0x00007fff8abc2272 in _CFAutoreleasePoolPop ()
#4 0x00007fff9323914f in -[NSAutoreleasePool drain] ()
#5 0x00007fff9200c041 in -[NSApplication run] ()
#6 0x00007fff91ff7424 in NSApplicationMain ()
#7 0x00000001019e4c02 in main at /Users/ap/Safari/OpenSource/Tools/MiniBrowser/mac/main.m:30
Daniel, are the crashes that you are observing in MarsEdit also happening when closing windows?
In the MiniBrowser case, it appears that WebView is doing pretty crazy things, which stretch my understanding of Objective-C runtime. Here is what happens:
1. The window gets closed, which deallocates WK1BrowserWindowController.
2. This eventually leads to WebView refcount dropping to 0, and -[WebView dealloc] being called from an autorelease pool draining loop.
3. -[WebView dealloc] calls -[WebView close], which calls -[WebView _close].
4. -[WebView _close] does two sketchy things:
- Post a WebViewWillCloseNotification notification. This retains self, which is already in the process of being deallocated. A notification listener can of course perform arbitrary work, although in the case of MiniBrowser, there is no listener.
5. Once done, returns to the autorelease draining loop, and hits an overrelease.
I do not know for sure, but I'm assuming that is illegal to retain an object from inside -dealloc, and that it somehow causes the overrelease (despite all retain/release pairs being balanced).
One way to fix MiniBrowser is to call -[WebView close] before -[WebView release]. But that's certainly fragile, and not what documentation for -close says.
I'm not sure what the real WebKit fix would be.
We can of course add a boolean data member for "isBeingDeallocated", and skip dangerous work in -_close if it's set, but that would result in not sending a notification when closing a MiniBrowser window.
Thanks for taking a look at this, Alexey. Indeed, the crashes I'm having in MarsEdit are also caused when closing windows. When I saw the same crash in MiniBrowser it encouraged me that it's an issue internal to WebView, WebCore, or maybe NSURLConnection/CFNetwork.
But you're right that the zombie element of the equation is perplexing. I don't think the over-releasing of the WebView happens though unless the problematic web content is loaded into the frame. And I am able to reproduce the problem using a new document-based Cocoa app whose document window has a xib-instantiated WebView that has the problematic content loaded in at nibDidLoad time.
This makes me think that something in the WebKit realm or below is over-releasing the WebView, and maybe the perceived messiness in ResourceHandleMac is not really a problem so long as the WebView sticks around for as long as it should.
What's most suspicious about an Instruments "Zombie" audit of the crash is the WebView in question is logged as being released twice by the "CFNetwork" library, but it is never credited for any retains. They are the last two recorded events for the object, culminating in the overrelease, each coming from:
> [NSURLConnectionInternal _withConnectionAndDelegate:onlyActive:]
The history view of Instruments is misleading, one needs to look at actual stack traces in its right pane. Specifically, this release that is credited to NSURLConnectionInternal is actually just draining an autorelease pool, and the -autorelease call happens in the expected correct place.
Yeah, but looking at the code in NSURLConnectionInternal, it's draining an AutoRelease pool that it pushed on the stack itself. It seems to be trying to manage the retain/release of itself and the delegate. I'm going to keep poking around and see if I can wrap my head around the numerous retain/releases and pair them off manually.
Created attachment 241122 [details]
Simple patch to wrap WebView's dealloc method with an autorelease pool
Created attachment 241123 [details]
Created attachment 241124 [details]
Stack trace showing how the same dealloc method series of events leads to autorelease the WebView
OK, I think I've figured out what's going on on a more fine-grained basis. Alexey you were right in general about what was going on at dealloc time, but what I've additionally confirmed (couldn't tell if you already knew this from your description) is that the activity leading to an autorelease of the WebView during -dealloc is caused by the resource loading infrastructure of the WebView itself.
I think this would all be fine except that as a further consequence of shutting down the web view, active requests (of which there is now one) are forcibly shutdown, leading to an immediate, mirrored call to -[WebView _removeObjectForIdentifier:], which upon seeing that the last request has been removed from the identifier map, follows through on its commitment to [self autorelease]. (see attached DeallocAutorelease.txt)
Because this is all happening within the same run loop cycle as the -dealloc method, the WebView instance still exists at the point the retain and autorelease messages are sent. But very shortly the end of the -dealloc method is reached, [super dealloc] is called, and the autorelease pool holding the WebView is not released until after the WebView is well and truly gone.
So in short: WebView's dealloc has complicated enough implications for the active Web runtime that some care must be taken so that the view itself doesn't get put on the autorelease pool, because it won't be drained until after dealloc is done.
One solution that comes to mind is WebView's dealloc method could pop its own autorelease pool onto the stack to capture any of this behavior, releasing before it in turn calls through to -[super dealloc]. I've confirmed that this does alleviate the crashing behavior that led me to report this bug, so I'm attaching as a patch, though I can appreciate you may want to approach the problem in another way.
Created attachment 241125 [details]
Wrap the bulk of WebView's dealloc method with an autorelease pool to prevent self being autoreleased after dealloc
Created attachment 241127 [details]
Proposed fix including ChangeLog update
I am amenable to the idea of writing an API test for this but I have never done so and ran into a snag trying to get TestWebKitAPI to build. I'm probably done for the day but I may find the energy to tackle that tomorrow.
Comment on attachment 241127 [details]
Proposed fix including ChangeLog update
Clearing flags on attachment: 241127
Committed r175733: <http://trac.webkit.org/changeset/175733>
All reviewed patches have been landed. Closing bug.
This is a fine workaround, but I am pretty sure it’s illegal to retain an object that is already inside its dealloc method. So making it non-fatal with this autorelease pool is one thing, but another even better thing to do would be to fix the code path inside dealloc that ends up retaining the WebView. We could debug this by adding an override of retain, an "in dealloc" flag, and asserting that we are not in dealloc inside the retain before calling super. Then find a adjust the code so it doesn't retain/release itself.
There are two known places where we retain the WebView during dealloc:
1. When posting a notification:
[[NSNotificationCenter defaultCenter] postNotificationName:WebViewWillCloseNotification object:self];
It seems challenging to avoid that without changing WebKit API. One approach I can think of is to implicitly call -close when the view is removed from view hierarchy instead of when it's being deallocated. That seems risky - what if the application is just going to re-insert the view elsewhere?
In MiniBrowser case, we only destroy the view when closing the window, so setting view.shouldCloseWithWindow to TRUE would help, but that's not a general solution.
Interesting point, Darin. I guess this is probably a "lesser of two evils" situation, and the "anything could happen" complexity (from my perspective) of the tear down makes me worry that weird edge cases exist where the WebView could get autoreleased.
Although it certainly seems odd and unexpected to retain an object during release, I have to imagine it's pretty innocuous, considering the object is literally on the verge of being eliminated. So long as dealloc doesn't concern itself with the retain count of an object, I think it would be safe to retain it a million times before the end of dealloc, even if it would feel gross.
Alexey, for the sake of brainstorming on the two cases you list:
1. The retain/release pair for the notification posting seem especially innocuous since the post call is synchronous and the retain will be followed very shortly by a release.
The real point is that these call out to the app, which could then retain the WebView. The app has no way of knowing that the WebView is not already retained.
I suggest we find a way to change the semantics of the WebView in the future so that if a client doesn’t explicitly call close the WebViewWillCloseNotification is not sent, and the unload event is not sent to the webpage either.
Then this bug could still come up, but only for old clients, and eventually we can phase the behavior out entirely.
(In reply to comment #20)
> The real point is that these call out to the app, which could then retain
> the WebView. The app has no way of knowing that the WebView is not already
> I suggest we find a way to change the semantics of the WebView in the future
> so that if a client doesn’t explicitly call close the
> WebViewWillCloseNotification is not sent, and the unload event is not sent
> to the webpage either.
We should also figure out why WebViewWillCloseNotification was added, and if anyone is still using it (it's SPI).
> Then this bug could still come up, but only for old clients, and eventually
> we can phase the behavior out entirely.
Agreed. We should also figure out what our story should be for the modern API. Right now we don't have an explicit close method, we just invalidate everything on dealloc.
(FWIW, I think other classes such as NSWindow have hacks in place to handle being retained from inside dealloc).