Bug 26044

Summary: Crash at Node::nodeIndex()
Product: WebKit Reporter: Dimitri Glazkov (Google) <dglazkov>
Component: TextAssignee: Darin Adler <darin>
Status: RESOLVED FIXED    
Severity: Normal CC: ap, darin, eric, justin.garcia, ojan
Priority: P2 Keywords: InRadar
Version: 528+ (Nightly build)   
Hardware: All   
OS: All   
Attachments:
Description Flags
One attempt at a test case.
none
Speculative fix
none
fix; needs regression tests and perhaps to be split up since it fixes two bugs
none
more work in progress none

Dimitri Glazkov (Google)
Reported 2009-05-27 11:24:05 PDT
Looking at various crash reports, I am seeing a fairly common crash that looks like: *CRASHED* (EXC_BAD_ACCESS / KERN_INVALID_ADDRESS @0x90e43c50) WebCore::Node::nodeIndex() const WebCore::Range::textNodesMerged(WebCore::NodeWithIndex&, unsigned int) WebCore::Document::textNodesMerged(WebCore::Text*, unsigned int) WebCore::Node::normalize() WebCore::jsNodePrototypeFunctionNormalize(JSC::ExecState*, JSC::JSObject*, JSC::JSValuePtr, JSC::ArgList const&) JSC::Interpreter::cti_op_call_NotJSFunction(void*, ...) or for V8: *CRASHED* (EXCEPTION_ACCESS_VIOLATION @0x00720078) WebCore::Node::nodeIndex() WebCore::RangeBoundaryPoint::offset() WebCore::boundaryTextNodesMerged WebCore::Node::normalize() WebCore::NodeInternal::normalizeCallback v8::internal::Builtin_HandleApiCall v8::internal::JavaScriptFrame::is_at_function() It smells like a crash that occurs while the victim is using an RTE of some sort (TinyMCE or somesuch), given that we die in the .normalize() call. Sadly, I don't have any URLs or test cases at the moment. Any points/ideas on what could be happening are appreciated.
Attachments
One attempt at a test case. (2.50 KB, application/x-javascript)
2009-06-03 17:25 PDT, Eric Seidel (no email)
no flags
Speculative fix (5.16 KB, patch)
2009-06-03 17:41 PDT, Eric Seidel (no email)
no flags
fix; needs regression tests and perhaps to be split up since it fixes two bugs (31.56 KB, patch)
2009-06-04 13:49 PDT, Darin Adler
no flags
more work in progress (41.83 KB, patch)
2009-06-05 18:06 PDT, Darin Adler
no flags
Dimitri Glazkov (Google)
Comment 1 2009-05-27 11:27:11 PDT
Corresponding bug in Chromium Bug Tracker: http://crbug.com/9815
Alexey Proskuryakov
Comment 2 2009-05-28 04:24:27 PDT
Any info on when it started? r43809 had some changed to normalize(), but there's not enough data to blame this patch.
Dimitri Glazkov (Google)
Comment 3 2009-05-28 09:27:51 PDT
I did some digging, and found that this crash shows up as early as Chromium 2.0.158.0, which pulls WebKit at r39953.
Alexey Proskuryakov
Comment 4 2009-05-28 09:47:19 PDT
See also: bug 20471.
Alexey Proskuryakov
Comment 5 2009-05-28 09:57:26 PDT
<rdar://problem/6704578> This (not the trace from the above bug, but this one) is known to happen when editing notes in Google Notebook sometimes.
Dimitri Glazkov (Google)
Comment 6 2009-05-28 10:04:03 PDT
What an awesome time to crash :) I bet users love this.
Eric Seidel (no email)
Comment 7 2009-06-01 16:34:18 PDT
Alexey, can you please attach some of the crash traces from crashreporter. I find them much more readable than Chrome crash traces. Without a reproducible case, this bug isn't worth much. I'll stare at the code again a bit in a minute.
Jon@Chromium
Comment 8 2009-06-02 13:55:39 PDT
I tried playing with Google Notebook but I could not trigger the crash in WebKit nightly or Chromium. I am sure it crashes because we continue to get lots of reports but I could not find a repro case. I do know that there does not appear to be any extra DLLs causing trouble.
Eric Seidel (no email)
Comment 9 2009-06-02 17:12:31 PDT
Talked with Alexey. He doesn't believe more complete stack traces would be of use, and he can't post them here anyway due to sekrit Safari symbols. These look like the Node pointers are trashed. Certainly the example: http://crash/reportdetail?reportid=72f0a87465b7174 where it crashes (with the already pasted above stack trace) at 0x13. 0x13 doesn't even make sense that any of hte "this" in the stack is NULL. I expect we're dealing with trashed or deleted memory here.
Eric Seidel (no email)
Comment 10 2009-06-03 12:15:35 PDT
WebCore::Node::nodeIndex() WebCore::RangeBoundaryPoint::offset() WebCore::boundaryTextNodesMerged WebCore::Node::normalize() WebCore::NodeInternal::normalizeCallback is missing symbols: normalize is calling: document()->textNodesMerged(nextText.get(), offset); which in turn is calling Range::textNodesMerged and thus boundaryTextNodesMerged
Eric Seidel (no email)
Comment 11 2009-06-03 12:20:50 PDT
Ok, my first theory is that we're not detaching ranges correctly. Ranges hold onto their Nodes, but Documents do not hold onto their attached ranges. So if ranges fail to detach themselves, documents will still try to call methods on invalid ranges. Range::~Range() { if (m_start.container()) m_ownerDocument->detachRange(this); It's possible to change m_start using: void Range::setStart(PassRefPtr<Node> refNode, int offset, ExceptionCode& ec) I'm not sure yet if it's possible to pass refNode = 0.
Eric Seidel (no email)
Comment 12 2009-06-03 12:21:42 PDT
Nevermind, there is a check for: if (!refNode) { ec = NOT_FOUND_ERR; return; } It still might be possible that something else is modifying m_start, thus causing a detach failure. Again, this is only one theory.
Eric Seidel (no email)
Comment 13 2009-06-03 13:39:16 PDT
Actually, looks like we always attach ranges,and only detach them when they have a m_start.container(). That seems like this code: for (var x = 0; x < 10000; x++) document.createRange(); would end up with lots of attached Ranges which never got detached. Going to see if I can make a crashing test case. :)
Eric Seidel (no email)
Comment 14 2009-06-03 13:49:05 PDT
Hum... document.createRange() creates a range which starts and ends at the document node. We need ranges which have no start in order to trip this bug. Still investigating how to create those.
Eric Seidel (no email)
Comment 15 2009-06-03 15:23:16 PDT
Ok, I'm still not sure that this crash is caused by a deleted Range ending up left in the Document. The two places in the code where RangeEndPoint::set() could be called with a null node, seem to be guarded: void Range::textNodesMerged(NodeWithIndex& oldNode, unsigned offset) -- checks node()->previousSibling() void Range::textNodeSplit(Text* oldNode) -- checks oldNode->nextSibling() Then again, these crashes are in a release build. This would of course be a no-brainer to fix if we had a reproducible case...
Eric Seidel (no email)
Comment 16 2009-06-03 17:25:34 PDT
Created attachment 30934 [details] One attempt at a test case.
Eric Seidel (no email)
Comment 17 2009-06-03 17:41:56 PDT
Created attachment 30935 [details] Speculative fix 4 files changed, 52 insertions(+), 11 deletions(-)
Darin Adler
Comment 18 2009-06-04 09:10:11 PDT
Comment on attachment 30935 [details] Speculative fix I have no object to speculative fixes, but I don't see a theory of the bug here. If the theory is that someone m_start.container() becomes 0, could you tell me the hypothetical about how that could happen? Maybe with a little more checking into calls to functions on m_start we can come up with something stronger.
Eric Seidel (no email)
Comment 19 2009-06-04 09:20:02 PDT
(In reply to comment #18) > (From update of attachment 30935 [details] [review]) > I have no object to speculative fixes, but I don't see a theory of the bug > here. If the theory is that someone m_start.container() becomes 0, could you > tell me the hypothetical about how that could happen? Maybe with a little more > checking into calls to functions on m_start we can come up with something > stronger. Sure thing. I can demonstrate m_start.container() becoming zero. The attached test case does that by preventing the insertion of the split text node into the document. // This call will hit an ASSERT due to void Range::textNodeSplit(Text* oldNode) // being called with oldNode->nextSibling() == NULL // In release builds, the ASSERT won't fire and a Range attached to the document // will end up with an invalid start. Ranges with an invalid start aren't // properly detached in ~Range, causing the Document to end up with pointers // to deleted ranges. text.splitText(1); Looking at the the code, RangeBoundaryPoint::set() seems like the only way that one can make a RBP have a .container() == NULL. The only place it's called is from the text node handling code, which ASSERTs that ->nextSibling() and ->previousSibling() are != NULL, but doesn't handle when they are (for the release case). The only case I was able to come up with was using dom mutation events. There may be others though, I'm not sure. I never got my test case to crash due to *this* bug, but I did get it to hit the various ASSERTs which try to guard against RBP's container() ever being NULL. The test case to just hit those ASSERTs is much simpler than the one attached however. ;) Making m_start.container() null was the only way I could think of to get a dead range into the Document's m_ranges list. It's also possible that some unrelated memory smasher has smashed the Document, Node or Range. But a deleted Range left in the Document's m_ranges list seems to be the most likely theory.
Darin Adler
Comment 20 2009-06-04 09:35:46 PDT
(In reply to comment #19) > Sure thing. I can demonstrate m_start.container() becoming zero. This is where we need to concentrate our fixes. We need to fix the mutation hooks to work properly in all cases, with changes either at the call sites or in the Range class. Should be straightforward. Lets do it.
Darin Adler
Comment 21 2009-06-04 09:37:02 PDT
Comment on attachment 30935 [details] Speculative fix Lets fix the problems that cause ranges to get into a bad state instead. We can do this by hardening the DOM mutation calls into Range and/or tightening up the callers.
Eric Seidel (no email)
Comment 22 2009-06-04 09:47:04 PDT
(In reply to comment #20) > (In reply to comment #19) > > Sure thing. I can demonstrate m_start.container() becoming zero. > > This is where we need to concentrate our fixes. We need to fix the mutation > hooks to work properly in all cases, with changes either at the call sites or > in the Range class. > > Should be straightforward. Lets do it. I disagree. :) I did not find it straight-forward yesterday. I spent a while getting this far. Although now that I'm pretty sure this is due to dead Ranges hanging around in m_ranges on the Document, getting further should be easier. I think the real broken-ness is using m_start.container() to mean "attached" or "detached". Whether m_start.container() ever becomes invalid seems a separate question from "does the document still have a pointer to me". One causes bad behavior, one causes crashes. I'm trying to fix the crashes here, and ignoring the bad behavior for the moment. The code seemed to fan out enough for me to mentally figure out what cases could be putting us into a bad state. Fixing design not to crash seemed like the right approach. This bad state is already caught by *numerous* assertions, but none of them are firing in release mode, and we have no reproducible case with which to hit them in debug mode. Please re-consider. I don't think patching this one case where someone uses mutation events is likely to fix all the crashes we've seen from this code (since my impression is so few sites use mutation events anyway). If you stare at the code and see other case besides the mutation event one, I'm happy to fix those too, but no others were obvious to me yesterday.
Darin Adler
Comment 23 2009-06-04 09:47:42 PDT
(In reply to comment #22) > I think the real broken-ness is using m_start.container() to mean "attached" or > "detached". I don't agree. I'll take a cut at fixing this myself.
Eric Seidel (no email)
Comment 24 2009-06-04 09:48:41 PDT
Yes, I expect that m_start.container() being null can probably cause other crashes too... I guess the real problem is w/o a reproducible case I'm a blind man swatting flies.
Eric Seidel (no email)
Comment 25 2009-06-04 09:51:00 PDT
The only two uses of RangeBoundaryPoint::set(): http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1758 http://trac.webkit.org/browser/trunk/WebCore/dom/Range.cpp#L1788 (note that they're at the end of the document, further than Safari will scroll down due to the anchor)
Darin Adler
Comment 26 2009-06-04 10:14:19 PDT
(In reply to comment #24) > Yes, I expect that m_start.container() being null can probably cause other > crashes too... I guess the real problem is w/o a reproducible case I'm a blind > man swatting flies. I think your sleuthing on this so far has been great! This won't be hard to fix now.
Darin Adler
Comment 27 2009-06-04 13:49:32 PDT
Created attachment 30960 [details] fix; needs regression tests and perhaps to be split up since it fixes two bugs
Eric Seidel (no email)
Comment 28 2009-06-04 14:12:24 PDT
Comment on attachment 30960 [details] fix; needs regression tests and perhaps to be split up since it fixes two bugs I'm glad you took this on. A couple comments about your intermediary patch: I think we might still want to add a Document::isRangeAttached(Range*) function at least in #ifndef NDEBUG mode to make sure that the range is actually detached in the destructor. It seems like this: 0 if (!isDetached()) 99111 m_ownerDocument->detachRange(this); would now make more sense to just call this->detach(), since after that call isDetached() will still return true! I'm very glad that you made isDetached() its own function, that makes the code much clearer. It's still worries me that we'll end up with these notions of attachment out of sync (as demonstrated in the destructor above). // because the repercussions of setting a container node of a boundary to 0 are serious. seems a bit opaque. We could link to this bug there, or explain that isDetach() depends on that constraint never being violated. When I was writing the patch I considered documenting in Range.h Range's assumption that m_start and m_end will either both be null or neither be null. And that null means detached. Thanks again for taking this on.
Darin Adler
Comment 29 2009-06-05 18:01:17 PDT
> - if (m_start.container()) > - m_ownerDocument->detachRange(this); > + // We unconditionally detach as a speculative fix for https://bugs.webkit.org/show_bug.cgi?id=26044 > + m_ownerDocument->detachRange(this); I've changed my mind. Even though I don't like this comment, I do like this code change. I think we should fix it this way, and by only making this one change to this function. Then we can do the rest of my patch as a separate check-in.
Darin Adler
Comment 30 2009-06-05 18:06:25 PDT
Created attachment 31022 [details] more work in progress
Eric Seidel (no email)
Comment 31 2009-06-11 14:15:15 PDT
Comment on attachment 30935 [details] Speculative fix Darin reviewed this (verbally) at the WebKit party. Will land with updated comment shortly!
Eric Seidel (no email)
Comment 32 2009-06-11 14:21:21 PDT
Hopefully the adjusted comment meets Darin's expectations. Darin plans to do more work here, but that can really be broken off into a separate issue. I'll leave this open for darin to move his patch elsewhere. I'll clear the review flag though. Committing to http://svn.webkit.org/repository/webkit/trunk ... M WebCore/ChangeLog M WebCore/dom/Document.cpp M WebCore/dom/Range.cpp M WebCore/dom/RangeBoundaryPoint.h Committed r44617
Eric Seidel (no email)
Comment 33 2009-06-11 14:21:37 PDT
Comment on attachment 30935 [details] Speculative fix Landed in r44617
Darin Adler
Comment 34 2009-06-14 22:17:23 PDT
This bug is fixed now. The work in progress is a fix for a different bug along with cleanup.
Note You need to log in before you can comment on or make changes to this bug.