Bug 26044 - Crash at Node::nodeIndex()
: Crash at Node::nodeIndex()
Status: RESOLVED FIXED
: WebKit
Text
: 528+ (Nightly build)
: All All
: P2 Normal
Assigned To:
:
: InRadar
:
:
  Show dependency treegraph
 
Reported: 2009-05-27 11:24 PST by
Modified: 2009-06-14 22:17 PST (History)


Attachments
One attempt at a test case. (2.50 KB, application/x-javascript)
2009-06-03 17:25 PST, Eric Seidel
no flags Details
Speculative fix (5.16 KB, patch)
2009-06-03 17:41 PST, Eric Seidel
no flags Review Patch | Details | Formatted Diff | Diff
fix; needs regression tests and perhaps to be split up since it fixes two bugs (31.56 KB, patch)
2009-06-04 13:49 PST, Darin Adler
no flags Review Patch | Details | Formatted Diff | Diff
more work in progress (41.83 KB, patch)
2009-06-05 18:06 PST, Darin Adler
no flags Review Patch | Details | Formatted Diff | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-05-27 11:24:05 PST
Looking at various crash reports, I am seeing a fairly common crash that looks like:

*CRASHED* (EXC_BAD_ACCESS / KERN_INVALID_ADDRESS @0x90e43c50)

WebCore::Node::nodeIndex() const
WebCore::Range::textNodesMerged(WebCore::NodeWithIndex&, unsigned int)
WebCore::Document::textNodesMerged(WebCore::Text*, unsigned int)
WebCore::Node::normalize()
WebCore::jsNodePrototypeFunctionNormalize(JSC::ExecState*, JSC::JSObject*, JSC::JSValuePtr, JSC::ArgList const&)
JSC::Interpreter::cti_op_call_NotJSFunction(void*, ...)

or for V8:

*CRASHED* (EXCEPTION_ACCESS_VIOLATION @0x00720078)

WebCore::Node::nodeIndex()
WebCore::RangeBoundaryPoint::offset()
WebCore::boundaryTextNodesMerged
WebCore::Node::normalize()
WebCore::NodeInternal::normalizeCallback
v8::internal::Builtin_HandleApiCall
v8::internal::JavaScriptFrame::is_at_function()

It smells like a crash that occurs while the victim is using an RTE of some sort (TinyMCE or somesuch), given that we die in the .normalize() call.

Sadly, I don't have any URLs or test cases at the moment.

Any points/ideas on what could be happening are appreciated.
------- Comment #1 From 2009-05-27 11:27:11 PST -------
Corresponding bug in Chromium Bug Tracker: http://crbug.com/9815
------- Comment #2 From 2009-05-28 04:24:27 PST -------
Any info on when it started? r43809 had some changed to normalize(), but there's not enough data to blame this patch.
------- Comment #3 From 2009-05-28 09:27:51 PST -------
I did some digging, and found that this crash shows up as early as Chromium 2.0.158.0, which pulls WebKit at r39953.
------- Comment #4 From 2009-05-28 09:47:19 PST -------
See also: bug 20471.
------- Comment #5 From 2009-05-28 09:57:26 PST -------
<rdar://problem/6704578>

This (not the trace from the above bug, but this one) is known to happen when editing notes in Google Notebook sometimes.
------- Comment #6 From 2009-05-28 10:04:03 PST -------
What an awesome time to crash :) I bet users love this.
------- Comment #7 From 2009-06-01 16:34:18 PST -------
Alexey, can you please attach some of the crash traces from crashreporter.  I find them much more readable than Chrome crash traces.

Without a reproducible case, this bug isn't worth much.  I'll stare at the code again a bit in a minute.
------- Comment #8 From 2009-06-02 13:55:39 PST -------
I tried playing with Google Notebook but I could not trigger the crash in WebKit nightly or Chromium.  I am sure it crashes because we continue to get lots of reports but I could not find a repro case.  I do know that there does not appear to be any extra DLLs causing trouble.
------- Comment #9 From 2009-06-02 17:12:31 PST -------
Talked with Alexey.  He doesn't believe more complete stack traces would be of use, and he can't post them here anyway due to sekrit Safari symbols.

These look like the Node pointers are trashed.  Certainly the example:
http://crash/reportdetail?reportid=72f0a87465b7174
where it crashes (with the already pasted above stack trace) at 0x13.  0x13 doesn't even make sense that any of hte "this" in the stack is NULL.  I expect we're dealing with trashed or deleted memory here.
------- Comment #10 From 2009-06-03 12:15:35 PST -------
WebCore::Node::nodeIndex()
WebCore::RangeBoundaryPoint::offset()
WebCore::boundaryTextNodesMerged
WebCore::Node::normalize()
WebCore::NodeInternal::normalizeCallback

is missing symbols:

normalize is calling:
            document()->textNodesMerged(nextText.get(), offset);

which in turn is calling Range::textNodesMerged and thus boundaryTextNodesMerged
------- Comment #11 From 2009-06-03 12:20:50 PST -------
Ok, my first theory is that we're not detaching ranges correctly.  Ranges hold onto their Nodes, but Documents do not hold onto their attached ranges.  So if ranges fail to detach themselves, documents will still try to call methods on invalid ranges.

Range::~Range()
{
    if (m_start.container())
        m_ownerDocument->detachRange(this);

It's possible to change m_start using:
void Range::setStart(PassRefPtr<Node> refNode, int offset, ExceptionCode& ec)

I'm not sure yet if it's possible to pass refNode = 0.
------- Comment #12 From 2009-06-03 12:21:42 PST -------
Nevermind, there is a check for:
    if (!refNode) {
        ec = NOT_FOUND_ERR;
        return;
    }

It still might be possible that something else is modifying m_start, thus causing a detach failure.  Again, this is only one theory.
------- Comment #13 From 2009-06-03 13:39:16 PST -------
Actually, looks like we always attach ranges,and only detach them when they have a m_start.container().  That seems like this code:

for (var x = 0; x < 10000; x++)
   document.createRange();

would end up with lots of attached Ranges which never got detached.

Going to see if I can make a crashing test case. :)
------- Comment #14 From 2009-06-03 13:49:05 PST -------
Hum... document.createRange() creates a range which starts and ends at the document node.  We need ranges which have no start in order to trip this bug.  Still investigating how to create those.
------- Comment #15 From 2009-06-03 15:23:16 PST -------
Ok, I'm still not sure that this crash is caused by a deleted Range ending up left in the Document.  The two places in the code where RangeEndPoint::set() could be called with a null node, seem to be guarded:
void Range::textNodesMerged(NodeWithIndex& oldNode, unsigned offset) -- checks node()->previousSibling()
void Range::textNodeSplit(Text* oldNode) -- checks oldNode->nextSibling()

Then again, these crashes are in a release build.  This would of course be a no-brainer to fix if we had a reproducible case...
------- Comment #16 From 2009-06-03 17:25:34 PST -------
Created an attachment (id=30934) [details]
One attempt at a test case.
------- Comment #17 From 2009-06-03 17:41:56 PST -------
Created an attachment (id=30935) [details]
Speculative fix

 4 files changed, 52 insertions(+), 11 deletions(-)
------- Comment #18 From 2009-06-04 09:10:11 PST -------
(From update of attachment 30935 [details])
I have no object to speculative fixes, but I don't see a theory of the bug here. If the theory is that someone m_start.container() becomes 0, could you tell me the hypothetical about how that could happen? Maybe with a little more checking into calls to functions on m_start we can come up with something stronger.
------- Comment #19 From 2009-06-04 09:20:02 PST -------
(In reply to comment #18)
> (From update of attachment 30935 [details] [review])
> I have no object to speculative fixes, but I don't see a theory of the bug
> here. If the theory is that someone m_start.container() becomes 0, could you
> tell me the hypothetical about how that could happen? Maybe with a little more
> checking into calls to functions on m_start we can come up with something
> stronger.

Sure thing.  I can demonstrate m_start.container() becoming zero.  The attached test case does that by preventing the insertion of the split text node into the document.

// This call will hit an ASSERT due to void Range::textNodeSplit(Text* oldNode)
    // being called with oldNode->nextSibling() == NULL
    // In release builds, the ASSERT won't fire and a Range attached to the document
    // will end up with an invalid start.  Ranges with an invalid start aren't
    // properly detached in ~Range, causing the Document to end up with pointers
    // to deleted ranges.
    text.splitText(1);

Looking at the the code, RangeBoundaryPoint::set() seems like the only way that one can make a RBP have a .container() == NULL.  The only place it's called is from the text node handling code, which ASSERTs that ->nextSibling() and ->previousSibling() are != NULL, but doesn't handle when they are (for the release case).  The only case I was able to come up with was using dom mutation events.  There may be others though, I'm not sure.

I never got my test case to crash due to *this* bug, but I did get it to hit the various ASSERTs which try to guard against RBP's container() ever being NULL.  The test case to just hit those ASSERTs is much simpler than the one attached however. ;)

Making m_start.container() null was the only way I could think of to get a dead range into the Document's m_ranges list.  It's also possible that some unrelated memory smasher has smashed the Document, Node or Range.  But a deleted Range left in the Document's m_ranges list seems to be the most likely theory.
------- Comment #20 From 2009-06-04 09:35:46 PST -------
(In reply to comment #19)
> Sure thing. I can demonstrate m_start.container() becoming zero.

This is where we need to concentrate our fixes. We need to fix the mutation hooks to work properly in all cases, with changes either at the call sites or in the Range class.

Should be straightforward. Lets do it.
------- Comment #21 From 2009-06-04 09:37:02 PST -------
(From update of attachment 30935 [details])
Lets fix the problems that cause ranges to get into a bad state instead. We can do this by hardening the DOM mutation calls into Range and/or tightening up the callers.
------- Comment #22 From 2009-06-04 09:47:04 PST -------
(In reply to comment #20)
> (In reply to comment #19)
> > Sure thing. I can demonstrate m_start.container() becoming zero.
> 
> This is where we need to concentrate our fixes. We need to fix the mutation
> hooks to work properly in all cases, with changes either at the call sites or
> in the Range class.
> 
> Should be straightforward. Lets do it.

I disagree. :)

I did not find it straight-forward yesterday.  I spent a while getting this far.  Although now that I'm pretty sure this is due to dead Ranges hanging around in m_ranges on the Document, getting further should be easier.

I think the real broken-ness is using m_start.container() to mean "attached" or "detached".  Whether m_start.container() ever becomes invalid seems a separate question from "does the document still have a pointer to me".  One causes bad behavior, one causes crashes.  I'm trying to fix the crashes here, and ignoring the bad behavior for the moment.

The code seemed to fan out enough for me to mentally figure out what cases could be putting us into a bad state.  Fixing design not to crash seemed like the right approach.  This bad state is already caught by *numerous* assertions, but none of them are firing in release mode, and we have no reproducible case with which to hit them in debug mode.

Please re-consider.  I don't think patching this one case where someone uses mutation events is likely to fix all the crashes we've seen from this code (since my impression is so few sites use mutation events anyway).  If you stare at the code and see other case besides the mutation event one, I'm happy to fix those too, but no others were obvious to me yesterday.
------- Comment #23 From 2009-06-04 09:47:42 PST -------
(In reply to comment #22)
> I think the real broken-ness is using m_start.container() to mean "attached" or
> "detached".

I don't agree. I'll take a cut at fixing this myself.
------- Comment #24 From 2009-06-04 09:48:41 PST -------
Yes, I expect that m_start.container() being null can probably cause other crashes too...  I guess the real problem is w/o a reproducible case I'm a blind man swatting flies.
------- Comment #25 From 2009-06-04 09:51:00 PST -------
The only two uses of RangeBoundaryPoint::set():
http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1758
http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1788
(note that they're at the end of the document, further than Safari will scroll down due to the anchor)
------- Comment #26 From 2009-06-04 10:14:19 PST -------
(In reply to comment #24)
> Yes, I expect that m_start.container() being null can probably cause other
> crashes too...  I guess the real problem is w/o a reproducible case I'm a blind
> man swatting flies.

I think your sleuthing on this so far has been great! This won't be hard to fix now.
------- Comment #27 From 2009-06-04 13:49:32 PST -------
Created an attachment (id=30960) [details]
fix; needs regression tests and perhaps to be split up since it fixes two bugs
------- Comment #28 From 2009-06-04 14:12:24 PST -------
(From update of attachment 30960 [details])
I'm glad you took this on.

A couple comments about your intermediary patch:
I think we might still want to add a Document::isRangeAttached(Range*) function at least in #ifndef NDEBUG mode to make sure that the range is actually detached in the destructor.

It seems like this:
0     if (!isDetached())
99111         m_ownerDocument->detachRange(this);
would now make more sense to just call this->detach(), since after that call isDetached() will still return true!

I'm very glad that you made isDetached() its own function, that makes the code much clearer.  It's still worries me that we'll end up with these notions of attachment out of sync (as demonstrated in the destructor above).


 // because the repercussions of setting a container node of a boundary to 0 are serious.
seems a bit opaque.  We could link to this bug there, or explain that isDetach() depends on that constraint never being violated.

When I was writing the patch I considered documenting in Range.h Range's assumption that m_start and m_end will either both be null or neither be null.  And that null means detached.

Thanks again for taking this on.
------- Comment #29 From 2009-06-05 18:01:17 PST -------
> -    if (m_start.container())
> -        m_ownerDocument->detachRange(this);
> +    // We unconditionally detach as a speculative fix for https://bugs.webkit.org/show_bug.cgi?id=26044
> +    m_ownerDocument->detachRange(this);

I've changed my mind. Even though I don't like this comment, I do like this code change. I think we should fix it this way, and by only making this one change to this function. Then we can do the rest of my patch as a separate check-in.
------- Comment #30 From 2009-06-05 18:06:25 PST -------
Created an attachment (id=31022) [details]
more work in progress
------- Comment #31 From 2009-06-11 14:15:15 PST -------
(From update of attachment 30935 [details])
Darin reviewed this (verbally) at the WebKit party.  Will land with updated comment shortly!
------- Comment #32 From 2009-06-11 14:21:21 PST -------
Hopefully the adjusted comment meets Darin's expectations.

Darin plans to do more work here, but that can really be broken off into a separate issue.

I'll leave this open for darin to move his patch elsewhere.  I'll clear the review flag though.

Committing to http://svn.webkit.org/repository/webkit/trunk ...
    M    WebCore/ChangeLog
    M    WebCore/dom/Document.cpp
    M    WebCore/dom/Range.cpp
    M    WebCore/dom/RangeBoundaryPoint.h
Committed r44617
------- Comment #33 From 2009-06-11 14:21:37 PST -------
(From update of attachment 30935 [details])
Landed in r44617
------- Comment #34 From 2009-06-14 22:17:23 PST -------
This bug is fixed now. The work in progress is a fix for a different bug along with cleanup.