Bug 52819

Summary: Crash in WebCore::HistoryController::itemsAreClones
Product: WebKit Reporter: Tony Gentilcore <tonyg>
Component: WebCore Misc.Assignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, ap, beidson, commit-queue, creis, eric, fishd, mihaip, ojan, priyajeet.hora, webkit.review.bot
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Bug Depends on: 53402    
Bug Blocks:    
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description Tony Gentilcore 2011-01-20 09:37:56 PST
This is the most popular crasher on the Chromium dev channel. http://trac.webkit.org/changeset/76205 seems to be in the right area and timeframe.

0x021d5555	 [chrome.dll	 - historycontroller.cpp:669]	WebCore::HistoryController::itemsAreClones(WebCore::HistoryItem *,WebCore::HistoryItem *)
0x021d537b	 [chrome.dll	 - historycontroller.cpp:614]	WebCore::HistoryController::recursiveSetProvisionalItem(WebCore::HistoryItem *,WebCore::HistoryItem *,WebCore::FrameLoadType)
0x021d540d	 [chrome.dll	 - historycontroller.cpp:627]	WebCore::HistoryController::recursiveSetProvisionalItem(WebCore::HistoryItem *,WebCore::HistoryItem *,WebCore::FrameLoadType)
0x021d4a1a	 [chrome.dll	 - historycontroller.cpp:248]	WebCore::HistoryController::goToItem(WebCore::HistoryItem *,WebCore::FrameLoadType)
0x024b1b36	 [chrome.dll	 - webframeimpl.cpp:898]	WebKit::WebFrameImpl::loadHistoryItem(WebKit::WebHistoryItem const &)
0x01d373fd	 [chrome.dll	 - render_view.cc:1450]	RenderView::OnNavigate(ViewMsg_Navigate_Params const &)
0x00000005
Comment 1 Mihai Parparita 2011-01-20 09:40:49 PST
I think you meant http://trac.webkit.org/changeset/75336.
Comment 2 Charles Reis 2011-01-20 10:15:23 PST
Interesting.  There's an ASSERT on both arguments in recursiveSetProvisionalItem, but it turns out ASSERTs are compiled out of release builds:
http://webkit.org/coding/assertion-guidelines.html

It's possible the crash happens if both item1 and item2 are null.  We should still see the ASSERT failing even in debug builds if that's the case-- we just have to find out the repro steps.

Here's the crashing code, for reference:

bool HistoryController::itemsAreClones(HistoryItem* item1, HistoryItem* item2) const
{
    // If the item we're going to is a clone of the item we're at, then we do
    // not need to load it again.  The current frame tree and the frame tree
    // snapshot in the item have to match.
    // Note: Some clients treat a navigation to the current history item as
    // a reload.  Thus, if item1 and item2 are the same, we need to create a
    // new document and should not consider them clones.
    // (See http://webkit.org/b/35532 for details.)
    return item1 != item2
        && item1->itemSequenceNumber() == item2->itemSequenceNumber()
        && currentFramesMatchItem(item1)
        && item2->hasSameFrames(item1);
}
Comment 3 Mihai Parparita 2011-01-20 13:56:33 PST
BTW, from looking at about a dozen crash reports, these were all long-lived renderer processes, so this isn't happening during session restore (the comment at http://trac.webkit.org/browser/trunk/Source/WebKit/chromium/src/WebFrameImpl.cpp#L887 suggests that this is one case where currentItem can be null).
Comment 4 Charles Reis 2011-01-21 11:57:41 PST
Here's what we know so far.

The crash happens when recursiveSetProvisionalItem passes null for either "item" or "fromItem" (but not both) to itemsAreClones.  (If both were null, itemsAreClones would just return false before crashing.)

The crash only happens on recursive calls of depth 2 or more, meaning there has to be at least one subframe involved.  For example, we'll see this stack trace if we go back or forward to a page with frames (or on a subframe navigation).

If "item" is null, that means the original destination item had a null value in its HistoryItemVector of children.  Maybe that's possible?

If "fromItem" is null, that means fromItem->childItemWithTarget(frameName) returned null in the previous call.  That seems unlikely, because we know from the previous call to itemsAreClones that the frame trees of the current and destination item match.

Either way, I'm surprised we're not seeing crashes from debug builds due to recursiveSetProvisionalItem's ASSERTs.  I haven't figured out how to reproduce the crash yet, but I could put in some defensive code to help us find it.
Comment 5 Charles Reis 2011-01-21 13:01:24 PST
Just discovered another crash report from the same logic in recursiveGoToItem before I split it out into itemsAreClones in http://trac.webkit.org/changeset/75336.  Still trying to nail down the repro case, but this suggests that the underlying cause could be outside this patch.

My current theory is that we're passing in a corrupt history item with null as one of the child items.  That would lead to the stack trace in this crash, both before and after the 75336 patch.
Comment 6 Charles Reis 2011-01-21 14:02:39 PST
Created attachment 79788 [details]
Patch
Comment 7 WebKit Commit Bot 2011-01-21 15:47:08 PST
Comment on attachment 79788 [details]
Patch

Clearing flags on attachment: 79788

Committed r76406: <http://trac.webkit.org/changeset/76406>
Comment 8 WebKit Commit Bot 2011-01-21 15:47:15 PST
All reviewed patches have been landed.  Closing bug.
Comment 9 Mihai Parparita 2011-01-21 15:59:06 PST
Reopening since this bug isn't actually fixed.

Charlie, you can ping someone (adele?) in #webkit to get Bugzilla editing privileges even before you get committer. Also, for cases like this you can also open another bug with your small change and mark it as blocking this bug (IIRC, there's a historical preference for having one commit per bug).
Comment 10 Charles Reis 2011-01-28 13:08:15 PST
I found a plausible way this bug could be happening, but I haven't found exact repro steps yet.

From the crash dumps, it turns out that item2 (i.e., fromItem) is actually null, not item1.  In comment #4, I noted this was unlikely because the previous call to itemsAreClones told us the frame tree match, but it turns out that logic is buggy.

HistoryItem::hasSameFrames(otherItem) checks that the target name of every child in the current HistoryItem shows up in one of the children of otherItem.  It does not check the reverse, meaning that the current HistoryItem could have two children with target "foo" and otherItem could have one child with "foo" and another with "bar".

If we get into that state, we'll see this crash when we go back in frame "foo".  In that case, we'll try to call recursiveSetProvisionalItem on frame "bar", end up with NULL for fromItem, and crash in itemsAreClones.

Now we just need to figure out how the target of the current HistoryItem's children could get corrupted.
Comment 11 Brady Eidson 2011-01-31 12:31:22 PST
In radar as <rdar://problem/8938557>
Comment 12 Charles Reis 2011-02-01 19:02:20 PST
Created attachment 80874 [details]
Patch
Comment 13 Charles Reis 2011-02-01 19:06:28 PST
Comment on attachment 80874 [details]
Patch

I'm having trouble getting the layout test to reproduce the crash.  Going back twice with the back button reliably crashes in recursiveSetProvisionalItem, but neither queueBackNavigation nor history.back hit that code.  Is there another way I can tell the browser to go back in a layout test?
Comment 14 Darin Fisher (:fishd, Google) 2011-02-01 21:27:09 PST
queueBackNavigation is the layout test specific method for requesting a back navigation.  that will wait until the page is fully loaded before processing the back navigation.  maybe that is related to why this isn't reproducible?
Comment 15 Charles Reis 2011-02-02 13:43:48 PST
(In reply to comment #14)
> queueBackNavigation is the layout test specific method for requesting a back navigation.  that will wait until the page is fully loaded before processing the back navigation.  maybe that is related to why this isn't reproducible?

Actually, it's much more subtle!  It turns out history.back() does get to recursiveSetProvisionalItem, just like the back button.  It even has a memory error, but it happens not to crash.

We already figured out that we crash when we hit the back button because we delete fromItem after calling itemsAreClones (on the m_provisionalItem = item line, since fromItem and m_provisionalItem are the same in this case).  On the recursive call, fromItem is 0-- boom.

If we call history.back() instead, we delete that item *before* we get to itemsAreClones, back in HistoryController::goToItem() (when we call setGlobalHistoryItem).  Even though we're still using a deleted pointer in that case, we don't happen to crash.  Instead, itemsAreClones returns false when it notices fromItem doesn't have any children.  This is still bad, because it doesn't update the provisional item.

Either way, I noticed a flaw in my expectations file-- after going back twice, we should be at the original HistoryItem, not the second one.

All this boils down to the fact that we're not clearing the provisional item after getting to an error page (like the confirm form resubmission page in this test).

Question: what's the intended behavior for an error page?  Should we commit the provisional item and make it the last committed item?  Or should we cancel the provisional item and act like we didn't go anywhere?  (More specifically in this case, if I get an error page when I click back, what's supposed to happen if I click back again?)  I'm guessing we want to commit it, but that isn't currently happening.
Comment 16 Darin Fisher (:fishd, Google) 2011-02-02 13:46:44 PST
In Chromium (in render_view.cc), it takes care to load the error page with the "replace" parameter set to true when the error page should be shown in place of a back/forward entry.  It only sets replace to false when we need to show an error page for a new navigation that failed to complete.

In other words, I believe that in the back/forward case, the error page should take the place of the page you were trying to navigate back to.
Comment 17 Charles Reis 2011-02-02 18:01:41 PST
(In reply to comment #16)
> In Chromium (in render_view.cc), it takes care to load the error page with the "replace" parameter set to true when the error page should be shown in place of a back/forward entry.  It only sets replace to false when we need to show an error page for a new navigation that failed to complete.
> 
> In other words, I believe that in the back/forward case, the error page should take the place of the page you were trying to navigate back to.

Thanks-- that helps.  I was able to get the error page to commit by changing HistoryController::updateForCommit.  (It needed to commit for FrameLoadTypeReplace in addition to the back/forward load types.)

After that, Chrome seems to work properly.  I needed another change to test_shell, so that it doesn't forget the page ID when we do the load (of the error page).  test_shell is now working.

Unfortunately, DumpRenderTree is also still behaving incorrectly-- it doesn't commit when we go back or forward to an error page.  I'll probably need to update that as well.  I'll upload the patch of what I have so far for feedback, and I'll work on DumpRenderTree in the mean time.
Comment 18 Charles Reis 2011-02-02 18:08:19 PST
Created attachment 81018 [details]
Patch
Comment 19 Darin Fisher (:fishd, Google) 2011-02-02 22:02:53 PST
Comment on attachment 81018 [details]
Patch

This change makes sense to me.
Comment 20 Brady Eidson 2011-02-03 08:52:25 PST
Comment on attachment 81018 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=81018&action=review

I understand the goal is to keep certain items Ref'ed, as previously they were being destroyed before last-use.  However this is mis-using (Pass)RefPtrs.

PassRefPtr is for transferring ownership, which you're not actually doing (except for the one case I pointed out).

If that one case is the only one necessary to change, then that's all you should change.

If the other changes are necessary, you need to leave them as raw pointers and simply RefPtr<> them within the function body where you think they might unintentionally be destroyed.

> Source/WebCore/loader/HistoryController.cpp:613
>  {

With item, you are potentially passing ownership, and it makes sense to use a PassRefPtr.  For fromItem, I only see a .get() comparison, so the PassRefPtr isn't needed.

> Source/WebCore/loader/HistoryController.cpp:621
>  

m_provisionalItem is already a RefPtr - Why do you need the extra targetItem RefPtr here?  You're adding ref-count churn which is surprisingly bad for performance.

> Source/WebCore/loader/HistoryController.cpp:640
>  {

Same thing here - PassRefPtrs are for actually passing ownership, and I don't see that you are passing ownership in this method.
Comment 21 Brady Eidson 2011-02-03 08:53:27 PST
(In reply to comment #20)
> (From update of attachment 81018 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=81018&action=review
> 
> I understand the goal is to keep certain items Ref'ed, as previously they were being destroyed before last-use.  However this is mis-using (Pass)RefPtrs.
> 
> PassRefPtr is for transferring ownership, which you're not actually doing (except for the one case I pointed out).
> 
> If that one case is the only one necessary to change, then that's all you should change.
> 
> If the other changes are necessary, you need to leave them as raw pointers and simply RefPtr<> them within the function body where you think they might unintentionally be destroyed.
> 
> > Source/WebCore/loader/HistoryController.cpp:613
> >  {
> 
> With item, you are potentially passing ownership, and it makes sense to use a PassRefPtr.  For fromItem, I only see a .get() comparison, so the PassRefPtr isn't needed.
> 
> > Source/WebCore/loader/HistoryController.cpp:621
> >  
> 
> m_provisionalItem is already a RefPtr - Why do you need the extra targetItem RefPtr here?  You're adding ref-count churn which is surprisingly bad for performance.
> 
> > Source/WebCore/loader/HistoryController.cpp:640
> >  {
> 
> Same thing here - PassRefPtrs are for actually passing ownership, and I don't see that you are passing ownership in this method.

Sorry for the commenting fail here - the inline review comments feature never does what I intend.  sigh.  Hopefully lining up the line numbers manually will make it obvious what I meant.
Comment 22 Eric Seidel (no email) 2011-02-03 12:18:41 PST
Maybe Adam or Ojan can fix the review tool to do what you intend?  Can you be more specific?
Comment 23 Adam Barth 2011-02-03 12:29:46 PST
> Hopefully lining up the line numbers manually will make it obvious what I meant.

You can also use the "view in context" link to have the review tool line up the comments with the lines for you.
Comment 24 Charles Reis 2011-02-03 15:02:52 PST
Comment on attachment 81018 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=81018&action=review

Thanks.  I've got an improved fix but my approach on the layout test isn't going to work.  (DumpRenderTree doesn't show error pages in either Chromium or WebKit.)  Thankfully, there's another set of repro steps in bug 53708.  I think I'll try to fix that issue here, since the layout test for that works here as well.

>>> Source/WebCore/loader/HistoryController.cpp:613
>>>  {
>> 
>> With item, you are potentially passing ownership, and it makes sense to use a PassRefPtr.  For fromItem, I only see a .get() comparison, so the PassRefPtr isn't needed.
> 
> Sorry for the commenting fail here - the inline review comments feature never does what I intend.  sigh.  Hopefully lining up the line numbers manually will make it obvious what I meant.

The bug was that m_provisionalItem and fromItem were the same, so that assigning item to m_provisionalItem deleted the old value (i.e., fromItem) while we still needed it.  I was using PassRefPtr for fromItem to prevent this, but you're right that it's incorrect.  All we need is to change currentItem to a RefPtr in goToItem (above) to ensure it doesn't get deleted here.

As you mentioned, we're potentially passing ownership of item, but we received it as a raw pointer anyway (in goToItem).  In retrospect, I don't think we need PassRefPtr even there.
Comment 25 Charles Reis 2011-02-03 15:49:15 PST
Created attachment 81128 [details]
Patch
Comment 26 Charles Reis 2011-02-03 15:50:32 PST
Comment on attachment 81128 [details]
Patch

This patch fixes the memory error (with a RefPtr in goToItem), the issue that going back to an error page needs to be committed (in updateForCommit), and the issue that going back to a hash in a frame needs to be committed (in updateForSameDocumentNavigation).

As I mentioned, DumpRenderTree doesn't show error pages, so the layout test covers the memory error and the commit of subframe hash navigations.
Comment 27 Charles Reis 2011-02-04 12:47:00 PST
Comment on attachment 81128 [details]
Patch

Anyone have a chance to look at this today?  If it looks good, it'd be great to get the crash fix in before the weekend.
Comment 28 Mihai Parparita 2011-02-04 13:45:30 PST
Comment on attachment 81128 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=81128&action=review

> LayoutTests/fast/history/history-back-forward-within-subframe-hash.html:3
> +onload = function() {

Nit: There isn't a formal style guide for JavaScript used in layout tests, but generally the convention is to follow the WebKit C++ style (four space indents, function braces on their own line).

> LayoutTests/fast/history/resources/history-back-forward-within-subframe-hash-2.html:15
> +      parent.document.getElementById("result").innerText = "First visit";

If you use the js-test framework (see same-document-iframes-changing-fragment.html, history-traversal-is-asynchronous.html, and a few other tests in the directory), then you can turn these into log statements instead of replacing the result text. That makes failures easier to debug, since you can more clearly see at what stage the actual output started to deviate from the expected one.
Comment 29 Charles Reis 2011-02-04 14:32:36 PST
Created attachment 81286 [details]
Patch
Comment 30 WebKit Commit Bot 2011-02-04 16:46:38 PST
The commit-queue encountered the following flaky tests while processing attachment 81286 [details]:

http/tests/websocket/tests/multiple-connections.html bug 53825 (author: abarth@webkit.org)
The commit-queue is continuing to process your patch.
Comment 31 WebKit Commit Bot 2011-02-04 16:48:42 PST
Comment on attachment 81286 [details]
Patch

Clearing flags on attachment: 81286

Committed r77705: <http://trac.webkit.org/changeset/77705>
Comment 32 WebKit Commit Bot 2011-02-04 16:48:50 PST
All reviewed patches have been landed.  Closing bug.
Comment 33 WebKit Review Bot 2011-02-04 18:32:39 PST
http://trac.webkit.org/changeset/77705 might have broken GTK Linux 64-bit Debug