Bug 26044 - Crash at Node::nodeIndex()
Summary: Crash at Node::nodeIndex()
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Text (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: Darin Adler
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2009-05-27 11:24 PDT by Dimitri Glazkov (Google)
Modified: 2009-06-14 22:17 PDT (History)
5 users (show)

See Also:


Attachments
One attempt at a test case. (2.50 KB, application/x-javascript)
2009-06-03 17:25 PDT, Eric Seidel (no email)
no flags Details
Speculative fix (5.16 KB, patch)
2009-06-03 17:41 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
fix; needs regression tests and perhaps to be split up since it fixes two bugs (31.56 KB, patch)
2009-06-04 13:49 PDT, Darin Adler
no flags Details | Formatted Diff | Diff
more work in progress (41.83 KB, patch)
2009-06-05 18:06 PDT, Darin Adler
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dimitri Glazkov (Google) 2009-05-27 11:24:05 PDT
Looking at various crash reports, I am seeing a fairly common crash that looks like:

*CRASHED* (EXC_BAD_ACCESS / KERN_INVALID_ADDRESS @0x90e43c50)
 
WebCore::Node::nodeIndex() const
WebCore::Range::textNodesMerged(WebCore::NodeWithIndex&, unsigned int)
WebCore::Document::textNodesMerged(WebCore::Text*, unsigned int)
WebCore::Node::normalize()
WebCore::jsNodePrototypeFunctionNormalize(JSC::ExecState*, JSC::JSObject*, JSC::JSValuePtr, JSC::ArgList const&)
JSC::Interpreter::cti_op_call_NotJSFunction(void*, ...)

or for V8:

*CRASHED* (EXCEPTION_ACCESS_VIOLATION @0x00720078)

WebCore::Node::nodeIndex()
WebCore::RangeBoundaryPoint::offset()
WebCore::boundaryTextNodesMerged
WebCore::Node::normalize()
WebCore::NodeInternal::normalizeCallback
v8::internal::Builtin_HandleApiCall
v8::internal::JavaScriptFrame::is_at_function()

It smells like a crash that occurs while the victim is using an RTE of some sort (TinyMCE or somesuch), given that we die in the .normalize() call.

Sadly, I don't have any URLs or test cases at the moment.

Any points/ideas on what could be happening are appreciated.
Comment 1 Dimitri Glazkov (Google) 2009-05-27 11:27:11 PDT
Corresponding bug in Chromium Bug Tracker: http://crbug.com/9815
Comment 2 Alexey Proskuryakov 2009-05-28 04:24:27 PDT
Any info on when it started? r43809 had some changed to normalize(), but there's not enough data to blame this patch.
Comment 3 Dimitri Glazkov (Google) 2009-05-28 09:27:51 PDT
I did some digging, and found that this crash shows up as early as Chromium 2.0.158.0, which pulls WebKit at r39953.
Comment 4 Alexey Proskuryakov 2009-05-28 09:47:19 PDT
See also: bug 20471.
Comment 5 Alexey Proskuryakov 2009-05-28 09:57:26 PDT
<rdar://problem/6704578>

This (not the trace from the above bug, but this one) is known to happen when editing notes in Google Notebook sometimes.
Comment 6 Dimitri Glazkov (Google) 2009-05-28 10:04:03 PDT
What an awesome time to crash :) I bet users love this.
Comment 7 Eric Seidel (no email) 2009-06-01 16:34:18 PDT
Alexey, can you please attach some of the crash traces from crashreporter.  I find them much more readable than Chrome crash traces.

Without a reproducible case, this bug isn't worth much.  I'll stare at the code again a bit in a minute.
Comment 8 Jon@Chromium 2009-06-02 13:55:39 PDT
I tried playing with Google Notebook but I could not trigger the crash in WebKit nightly or Chromium.  I am sure it crashes because we continue to get lots of reports but I could not find a repro case.  I do know that there does not appear to be any extra DLLs causing trouble.
Comment 9 Eric Seidel (no email) 2009-06-02 17:12:31 PDT
Talked with Alexey.  He doesn't believe more complete stack traces would be of use, and he can't post them here anyway due to sekrit Safari symbols.

These look like the Node pointers are trashed.  Certainly the example:
http://crash/reportdetail?reportid=72f0a87465b7174
where it crashes (with the already pasted above stack trace) at 0x13.  0x13 doesn't even make sense that any of hte "this" in the stack is NULL.  I expect we're dealing with trashed or deleted memory here.
Comment 10 Eric Seidel (no email) 2009-06-03 12:15:35 PDT
WebCore::Node::nodeIndex()
WebCore::RangeBoundaryPoint::offset()
WebCore::boundaryTextNodesMerged
WebCore::Node::normalize()
WebCore::NodeInternal::normalizeCallback

is missing symbols:

normalize is calling:
            document()->textNodesMerged(nextText.get(), offset);

which in turn is calling Range::textNodesMerged and thus boundaryTextNodesMerged
Comment 11 Eric Seidel (no email) 2009-06-03 12:20:50 PDT
Ok, my first theory is that we're not detaching ranges correctly.  Ranges hold onto their Nodes, but Documents do not hold onto their attached ranges.  So if ranges fail to detach themselves, documents will still try to call methods on invalid ranges.

Range::~Range()
{
    if (m_start.container())
        m_ownerDocument->detachRange(this);

It's possible to change m_start using:
void Range::setStart(PassRefPtr<Node> refNode, int offset, ExceptionCode& ec)

I'm not sure yet if it's possible to pass refNode = 0.
Comment 12 Eric Seidel (no email) 2009-06-03 12:21:42 PDT
Nevermind, there is a check for:
    if (!refNode) {
        ec = NOT_FOUND_ERR;
        return;
    }

It still might be possible that something else is modifying m_start, thus causing a detach failure.  Again, this is only one theory.
Comment 13 Eric Seidel (no email) 2009-06-03 13:39:16 PDT
Actually, looks like we always attach ranges,and only detach them when they have a m_start.container().  That seems like this code:

for (var x = 0; x < 10000; x++)
   document.createRange();

would end up with lots of attached Ranges which never got detached.

Going to see if I can make a crashing test case. :)
Comment 14 Eric Seidel (no email) 2009-06-03 13:49:05 PDT
Hum... document.createRange() creates a range which starts and ends at the document node.  We need ranges which have no start in order to trip this bug.  Still investigating how to create those.
Comment 15 Eric Seidel (no email) 2009-06-03 15:23:16 PDT
Ok, I'm still not sure that this crash is caused by a deleted Range ending up left in the Document.  The two places in the code where RangeEndPoint::set() could be called with a null node, seem to be guarded:
void Range::textNodesMerged(NodeWithIndex& oldNode, unsigned offset) -- checks node()->previousSibling()
void Range::textNodeSplit(Text* oldNode) -- checks oldNode->nextSibling()

Then again, these crashes are in a release build.  This would of course be a no-brainer to fix if we had a reproducible case...
Comment 16 Eric Seidel (no email) 2009-06-03 17:25:34 PDT
Created attachment 30934 [details]
One attempt at a test case.
Comment 17 Eric Seidel (no email) 2009-06-03 17:41:56 PDT
Created attachment 30935 [details]
Speculative fix

 4 files changed, 52 insertions(+), 11 deletions(-)
Comment 18 Darin Adler 2009-06-04 09:10:11 PDT
Comment on attachment 30935 [details]
Speculative fix

I have no object to speculative fixes, but I don't see a theory of the bug here. If the theory is that someone m_start.container() becomes 0, could you tell me the hypothetical about how that could happen? Maybe with a little more checking into calls to functions on m_start we can come up with something stronger.
Comment 19 Eric Seidel (no email) 2009-06-04 09:20:02 PDT
(In reply to comment #18)
> (From update of attachment 30935 [details] [review])
> I have no object to speculative fixes, but I don't see a theory of the bug
> here. If the theory is that someone m_start.container() becomes 0, could you
> tell me the hypothetical about how that could happen? Maybe with a little more
> checking into calls to functions on m_start we can come up with something
> stronger.

Sure thing.  I can demonstrate m_start.container() becoming zero.  The attached test case does that by preventing the insertion of the split text node into the document.

// This call will hit an ASSERT due to void Range::textNodeSplit(Text* oldNode)
    // being called with oldNode->nextSibling() == NULL
    // In release builds, the ASSERT won't fire and a Range attached to the document
    // will end up with an invalid start.  Ranges with an invalid start aren't
    // properly detached in ~Range, causing the Document to end up with pointers
    // to deleted ranges.
    text.splitText(1);

Looking at the the code, RangeBoundaryPoint::set() seems like the only way that one can make a RBP have a .container() == NULL.  The only place it's called is from the text node handling code, which ASSERTs that ->nextSibling() and ->previousSibling() are != NULL, but doesn't handle when they are (for the release case).  The only case I was able to come up with was using dom mutation events.  There may be others though, I'm not sure.

I never got my test case to crash due to *this* bug, but I did get it to hit the various ASSERTs which try to guard against RBP's container() ever being NULL.  The test case to just hit those ASSERTs is much simpler than the one attached however. ;)

Making m_start.container() null was the only way I could think of to get a dead range into the Document's m_ranges list.  It's also possible that some unrelated memory smasher has smashed the Document, Node or Range.  But a deleted Range left in the Document's m_ranges list seems to be the most likely theory.
Comment 20 Darin Adler 2009-06-04 09:35:46 PDT
(In reply to comment #19)
> Sure thing. I can demonstrate m_start.container() becoming zero.

This is where we need to concentrate our fixes. We need to fix the mutation hooks to work properly in all cases, with changes either at the call sites or in the Range class.

Should be straightforward. Lets do it.
Comment 21 Darin Adler 2009-06-04 09:37:02 PDT
Comment on attachment 30935 [details]
Speculative fix

Lets fix the problems that cause ranges to get into a bad state instead. We can do this by hardening the DOM mutation calls into Range and/or tightening up the callers.
Comment 22 Eric Seidel (no email) 2009-06-04 09:47:04 PDT
(In reply to comment #20)
> (In reply to comment #19)
> > Sure thing. I can demonstrate m_start.container() becoming zero.
> 
> This is where we need to concentrate our fixes. We need to fix the mutation
> hooks to work properly in all cases, with changes either at the call sites or
> in the Range class.
> 
> Should be straightforward. Lets do it.

I disagree. :)

I did not find it straight-forward yesterday.  I spent a while getting this far.  Although now that I'm pretty sure this is due to dead Ranges hanging around in m_ranges on the Document, getting further should be easier.

I think the real broken-ness is using m_start.container() to mean "attached" or "detached".  Whether m_start.container() ever becomes invalid seems a separate question from "does the document still have a pointer to me".  One causes bad behavior, one causes crashes.  I'm trying to fix the crashes here, and ignoring the bad behavior for the moment.

The code seemed to fan out enough for me to mentally figure out what cases could be putting us into a bad state.  Fixing design not to crash seemed like the right approach.  This bad state is already caught by *numerous* assertions, but none of them are firing in release mode, and we have no reproducible case with which to hit them in debug mode.

Please re-consider.  I don't think patching this one case where someone uses mutation events is likely to fix all the crashes we've seen from this code (since my impression is so few sites use mutation events anyway).  If you stare at the code and see other case besides the mutation event one, I'm happy to fix those too, but no others were obvious to me yesterday.
Comment 23 Darin Adler 2009-06-04 09:47:42 PDT
(In reply to comment #22)
> I think the real broken-ness is using m_start.container() to mean "attached" or
> "detached".

I don't agree. I'll take a cut at fixing this myself.
Comment 24 Eric Seidel (no email) 2009-06-04 09:48:41 PDT
Yes, I expect that m_start.container() being null can probably cause other crashes too...  I guess the real problem is w/o a reproducible case I'm a blind man swatting flies.
Comment 25 Eric Seidel (no email) 2009-06-04 09:51:00 PDT
The only two uses of RangeBoundaryPoint::set():
http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1758
http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1788
(note that they're at the end of the document, further than Safari will scroll down due to the anchor)
Comment 26 Darin Adler 2009-06-04 10:14:19 PDT
(In reply to comment #24)
> Yes, I expect that m_start.container() being null can probably cause other
> crashes too...  I guess the real problem is w/o a reproducible case I'm a blind
> man swatting flies.

I think your sleuthing on this so far has been great! This won't be hard to fix now.
Comment 27 Darin Adler 2009-06-04 13:49:32 PDT
Created attachment 30960 [details]
fix; needs regression tests and perhaps to be split up since it fixes two bugs
Comment 28 Eric Seidel (no email) 2009-06-04 14:12:24 PDT
Comment on attachment 30960 [details]
fix; needs regression tests and perhaps to be split up since it fixes two bugs

I'm glad you took this on.

A couple comments about your intermediary patch:
I think we might still want to add a Document::isRangeAttached(Range*) function at least in #ifndef NDEBUG mode to make sure that the range is actually detached in the destructor.

It seems like this:
0     if (!isDetached())
99111         m_ownerDocument->detachRange(this);
would now make more sense to just call this->detach(), since after that call isDetached() will still return true!

I'm very glad that you made isDetached() its own function, that makes the code much clearer.  It's still worries me that we'll end up with these notions of attachment out of sync (as demonstrated in the destructor above).


 // because the repercussions of setting a container node of a boundary to 0 are serious.
seems a bit opaque.  We could link to this bug there, or explain that isDetach() depends on that constraint never being violated.

When I was writing the patch I considered documenting in Range.h Range's assumption that m_start and m_end will either both be null or neither be null.  And that null means detached.

Thanks again for taking this on.
Comment 29 Darin Adler 2009-06-05 18:01:17 PDT
> -    if (m_start.container())
> -        m_ownerDocument->detachRange(this);
> +    // We unconditionally detach as a speculative fix for https://bugs.webkit.org/show_bug.cgi?id=26044
> +    m_ownerDocument->detachRange(this);

I've changed my mind. Even though I don't like this comment, I do like this code change. I think we should fix it this way, and by only making this one change to this function. Then we can do the rest of my patch as a separate check-in.
Comment 30 Darin Adler 2009-06-05 18:06:25 PDT
Created attachment 31022 [details]
more work in progress
Comment 31 Eric Seidel (no email) 2009-06-11 14:15:15 PDT
Comment on attachment 30935 [details]
Speculative fix

Darin reviewed this (verbally) at the WebKit party.  Will land with updated comment shortly!
Comment 32 Eric Seidel (no email) 2009-06-11 14:21:21 PDT
Hopefully the adjusted comment meets Darin's expectations.

Darin plans to do more work here, but that can really be broken off into a separate issue.

I'll leave this open for darin to move his patch elsewhere.  I'll clear the review flag though.

Committing to http://svn.webkit.org/repository/webkit/trunk ...
	M	WebCore/ChangeLog
	M	WebCore/dom/Document.cpp
	M	WebCore/dom/Range.cpp
	M	WebCore/dom/RangeBoundaryPoint.h
Committed r44617
Comment 33 Eric Seidel (no email) 2009-06-11 14:21:37 PDT
Comment on attachment 30935 [details]
Speculative fix

Landed in r44617
Comment 34 Darin Adler 2009-06-14 22:17:23 PDT
This bug is fixed now. The work in progress is a fix for a different bug along with cleanup.