Bug 139929 - AX: Crash: com.apple.WebKit.WebContent at com.apple.WebCore: WebCore::AXObjectCache::clearTextMarkerNodesInUse + 149
Summary: AX: Crash: com.apple.WebKit.WebContent at com.apple.WebCore: WebCore::AXObjec...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Accessibility (show other bugs)
Version: 528+ (Nightly build)
Hardware: All All
: P2 Normal
Assignee: chris fleizach
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2014-12-23 17:12 PST by chris fleizach
Modified: 2015-01-07 08:56 PST (History)
14 users (show)

See Also:


Attachments
patch (3.76 KB, patch)
2014-12-23 17:23 PST, chris fleizach
no flags Details | Formatted Diff | Diff
patch (5.78 KB, patch)
2015-01-04 00:22 PST, chris fleizach
darin: review+
buildbot: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from ews101 for mac-mountainlion (514.12 KB, application/zip)
2015-01-04 01:08 PST, Build Bot
no flags Details
Archive of layout-test-results from ews106 for mac-mountainlion-wk2 (525.16 KB, application/zip)
2015-01-04 01:12 PST, Build Bot
no flags Details
final patch for running through EWS (9.08 KB, patch)
2015-01-06 18:41 PST, chris fleizach
buildbot: commit-queue-
Details | Formatted Diff | Diff
Archive of layout-test-results from ews100 for mac-mountainlion (624.88 KB, application/zip)
2015-01-06 19:21 PST, Build Bot
no flags Details
Archive of layout-test-results from ews106 for mac-mountainlion-wk2 (918.74 KB, application/zip)
2015-01-06 19:34 PST, Build Bot
no flags Details
final patch with all the files (10.24 KB, patch)
2015-01-06 22:59 PST, chris fleizach
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description chris fleizach 2014-12-23 17:12:46 PST
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   com.apple.WebCore             	0x00007fff90ee7c55 WebCore::AXObjectCache::clearTextMarkerNodesInUse(WebCore::Document*) + 149
1   com.apple.WebCore             	0x00007fff90d34bb6 WebCore::Frame::disconnectOwnerElement() + 54
2   com.apple.WebCore             	0x00007fff90d349e6 WebCore::HTMLFrameOwnerElement::disconnectContentFrame() + 38
3   com.apple.WebCore             	0x00007fff90f9e0cf WebCore::disconnectSubframes(WebCore::ContainerNode&, WebCore::SubframeDisconnectPolicy) + 271
4   com.apple.WebCore             	0x00007fff90c3233e WebCore::Document::prepareForDestruction() + 30
5   com.apple.WebCore             	0x00007fff90babf8e WebCore::Frame::setView(WTF::PassRefPtr<WebCore::FrameView>) + 62
6   com.apple.WebCore             	0x00007fff90babd4c WebCore::Frame::createView(WebCore::IntSize const&, WebCore::Color const&, bool, WebCore::IntSize const&, WebCore::IntRect const&, bool, WebCore::ScrollbarMode, bool, WebCore::ScrollbarMode, bool) + 124
7   com.apple.WebKit              	0x00007fff9a99b059 WebKit::WebFrameLoaderClient::transitionToCommittedForNewPage() + 241
8   com.apple.WebCore             	0x00007fff9113a9c6 WebCore::FrameLoader::transitionToCommitted(WebCore::CachedPage*) + 598
9   com.apple.WebCore             	0x00007fff90bab11a WebCore::FrameLoader::commitProvisionalLoad() + 378
10  com.apple.WebCore             	0x00007fff90c4bc6d WebCore::DocumentLoader::commitLoad(char const*, int) + 77
11  com.apple.WebCore             	0x00007fff90c4b88d WebCore::DocumentLoader::dataReceived(WebCore::CachedResource*, char const*, int) + 413
12  com.apple.WebCore             	0x00007fff90c4b511 WebCore::CachedRawResource::notifyClientsDataWasReceived(char const*, unsigned int) + 177
13  com.apple.WebCore             	0x00007fff90c4b0ea WebCore::CachedRawResource::addDataBuffer(WebCore::ResourceBuffer*) + 170
14  com.apple.WebCore             	0x00007fff90c4adc2 WebCore::SubresourceLoader::didReceiveDataOrBuffer(char const*, int, WTF::PassRefPtr<WebCore::SharedBuffer>, long long, WebCore::DataPayloadType) + 210
15  com.apple.WebCore             	0x00007fff918527a3 WebCore::SubresourceLoader::didReceiveData(char const*, unsigned int, long long, WebCore::DataPayloadType) + 35
16  com.apple.WebKit              	0x00007fff9ab7d3f7 WebKit::WebResourceLoader::didReceiveWebResourceLoaderMessage(IPC::Connection*, IPC::MessageDecoder&) + 463
17  com.apple.WebKit              	0x00007fff9aa2b16a IPC::Connection::dispatchMessage(std::__1::unique_ptr<IPC::MessageDecoder, std::__1::default_delete<IPC::MessageDecoder> >) + 94
Comment 1 chris fleizach 2014-12-23 17:12:55 PST
<rdar://problem/18588631>
Comment 2 chris fleizach 2014-12-23 17:23:24 PST
Created attachment 243715 [details]
patch

Patch is speculative. I was not able to reproduce this crash unfortunately and could not contrive a scenario that would cause it to happen
Comment 3 Darin Adler 2014-12-24 12:12:01 PST
Comment on attachment 243715 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=243715&action=review

I’m going to say review+ but I have some serious reservations with the code, and I think I have a guess what the real problem is, too.

> Source/WebCore/accessibility/AXObjectCache.cpp:1026
> -    cache->setNodeInUse(domNode);
> +    cache->setNodeInUse(domNode, &domNode->document());

It seems like the setNodeInUse function could easily get at the document itself; no good reason to pass it in here. Also, no reason to change it to a Document* before passing it in. I would make the argument a Document&.

> Source/WebCore/accessibility/AXObjectCache.cpp:1050
> +    // [Bug: 139929] Note: this used to ask the node if it was in a document, but there
> +    //    are some cases where nodes are not being cleaned up and that can crash.
> +    //    Instead we can store the document and node, and when we clean up the document we can
> +    //    remove all associated nodes.

This problem with this comment is that having stale pointers to deleted nodes will cause other problems too. Just making this one crash go away doesn’t solve the whole problem. The comment doesn’t make that clear, even though it’s a long comment. I think the comment should focus on “why” rather than reiterating exactly what we do, since you can see the what part from reading the code.

The “there are some cases” phrase here is not a good thing to write. We need to explain why, not simply state this as fact.

The “[Bug: 139929] Note:” is not a format we have used elsewhere in WebKit and I’d prefer not to do that.

> Source/WebCore/accessibility/AXObjectCache.cpp:1055
> +    for (const auto& node : m_textMarkerNodes.keys()) {
> +        if (m_textMarkerNodes.get(node) == document)
>              nodesToDelete.add(node);
>      }

When we see a loop like this, it usually indicates that the data structure is wrong. Normally we’d want to keep a set of nodes indexed by the document so we didn’t have to iterate the hash table just to find the nodes associated with a given document. That is a pre-existing problem with the data structure we were using. A good data structure for this might be a HashMap<Document*, HashSet<Node*>> alongside HashMap<Node*, Document*>. Should be straightforward to keep both up to date.

If we do keep this data structure as is, the new code to iterate it is unnecessarily inefficient in another way. When we iterate a map we can get at both the key and value. Getting just the key and then calling get to see the value causes a hash table lookup each time, greatly increasing execution time. Doing it more efficiently would look like this:

    for (auto& entry : m_textMarkerNodes) {
        if (entry.value == document)
            nodesToDelete.add(entry.key);
    }

> Source/WebCore/accessibility/AXObjectCache.h:227
>      // This is a weak reference cache for knowing if Nodes used by TextMarkers are valid.

I’m not sure I understand what a “weak reference cache” is exactly? There’s nothing here that makes these raw pointers act as safe weak references. It’s not safe to have pointers to nodes unless there is some kind of a lifetime guarantee. I guess that’s the whole point of this patch. Something bad like is happening because we did not plug the node lifetimes into this in an airtight way.

The current implementation is a little messy, but mostly seems sound, although it’s really hard to find all the relevant lines of code. The hole I see is that when a subframe loads a new document, nothing calls clearTextMarkerNodesInUse. That code is in Frame::disconnectOwnerElement, which is called when a frame is being destroyed, not when a frame is loading a new document. This will leave all the nodes from a document in the main frame’s AXObjectCache, and they can then be destroyed. I suspect that’s the real bug here.
Comment 4 Darin Adler 2014-12-24 12:15:56 PST
Comment on attachment 243715 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=243715&action=review

>> Source/WebCore/accessibility/AXObjectCache.h:227
>>      // This is a weak reference cache for knowing if Nodes used by TextMarkers are valid.
> 
> I’m not sure I understand what a “weak reference cache” is exactly? There’s nothing here that makes these raw pointers act as safe weak references. It’s not safe to have pointers to nodes unless there is some kind of a lifetime guarantee. I guess that’s the whole point of this patch. Something bad like is happening because we did not plug the node lifetimes into this in an airtight way.
> 
> The current implementation is a little messy, but mostly seems sound, although it’s really hard to find all the relevant lines of code. The hole I see is that when a subframe loads a new document, nothing calls clearTextMarkerNodesInUse. That code is in Frame::disconnectOwnerElement, which is called when a frame is being destroyed, not when a frame is loading a new document. This will leave all the nodes from a document in the main frame’s AXObjectCache, and they can then be destroyed. I suspect that’s the real bug here.

I would like to learn more about what “if nodes used by text markers are valid” means. There’s no way that a set can tell you if a Node* is valid. If the node has been destroyed, a new node might have been allocated with the same address. So if the true purpose here is “see if a Node* is still good or has already been deleted” then it’s based on a flawed premise. Doing that is simply not possible; we need some other solution.
Comment 5 chris fleizach 2014-12-24 12:59:25 PST
(In reply to comment #4)
> Comment on attachment 243715 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=243715&action=review
> 
> >> Source/WebCore/accessibility/AXObjectCache.h:227
> >>      // This is a weak reference cache for knowing if Nodes used by TextMarkers are valid.
> > 
> > I’m not sure I understand what a “weak reference cache” is exactly? There’s nothing here that makes these raw pointers act as safe weak references. It’s not safe to have pointers to nodes unless there is some kind of a lifetime guarantee. I guess that’s the whole point of this patch. Something bad like is happening because we did not plug the node lifetimes into this in an airtight way.
> > 
> > The current implementation is a little messy, but mostly seems sound, although it’s really hard to find all the relevant lines of code. The hole I see is that when a subframe loads a new document, nothing calls clearTextMarkerNodesInUse. That code is in Frame::disconnectOwnerElement, which is called when a frame is being destroyed, not when a frame is loading a new document. This will leave all the nodes from a document in the main frame’s AXObjectCache, and they can then be destroyed. I suspect that’s the real bug here.
> 
> I would like to learn more about what “if nodes used by text markers are
> valid” means. There’s no way that a set can tell you if a Node* is valid. If
> the node has been destroyed, a new node might have been allocated with the
> same address. So if the true purpose here is “see if a Node* is still good
> or has already been deleted” then it’s based on a flawed premise. Doing that
> is simply not possible; we need some other solution.

Background: The accessibility code exposes something called AXTextMarkers which encode a position and a Node pointer.

Those get passed around to clients like VoiceOver. A problem in the past was that an old AXTextMarker would be used and the Node would be de-referenced directly, leading to a crash.

To fix that we introduced a cache to keep track of which nodes we were using and which were still active.

I want to look into your suggestion as to where certain sub-frame Nodes are not being removed from this cache. Hopefully that can solve the real problem, which I also would much prefer too. This current fix is only trying to paper over the problem.
Comment 6 chris fleizach 2014-12-24 12:59:59 PST
(In reply to comment #4)
> Comment on attachment 243715 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=243715&action=review
> 
> >> Source/WebCore/accessibility/AXObjectCache.h:227
> >>      // This is a weak reference cache for knowing if Nodes used by TextMarkers are valid.
> > 
> > I’m not sure I understand what a “weak reference cache” is exactly? There’s nothing here that makes these raw pointers act as safe weak references. It’s not safe to have pointers to nodes unless there is some kind of a lifetime guarantee. I guess that’s the whole point of this patch. Something bad like is happening because we did not plug the node lifetimes into this in an airtight way.
> > 
> > The current implementation is a little messy, but mostly seems sound, although it’s really hard to find all the relevant lines of code. The hole I see is that when a subframe loads a new document, nothing calls clearTextMarkerNodesInUse. That code is in Frame::disconnectOwnerElement, which is called when a frame is being destroyed, not when a frame is loading a new document. This will leave all the nodes from a document in the main frame’s AXObjectCache, and they can then be destroyed. I suspect that’s the real bug here.
> 
> I would like to learn more about what “if nodes used by text markers are
> valid” means. There’s no way that a set can tell you if a Node* is valid. If
> the node has been destroyed, a new node might have been allocated with the
> same address. So if the true purpose here is “see if a Node* is still good
> or has already been deleted” then it’s based on a flawed premise. Doing that
> is simply not possible; we need some other solution.

Background: The accessibility code exposes something called AXTextMarkers which encode a position and a Node pointer.

Those get passed around to clients like VoiceOver. A problem in the past was that an old AXTextMarker would be used and the Node would be de-referenced directly, leading to a crash.

To fix that we introduced a cache to keep track of which nodes we were using and which were still active.

I want to look into your suggestion as to where certain sub-frame Nodes are not being removed from this cache. Hopefully that can solve the real problem, which I also would much prefer too. This current fix is only trying to paper over the problem.
Comment 7 chris fleizach 2015-01-04 00:22:48 PST
Created attachment 243923 [details]
patch

Based on Darin's comments, I took a new look at it and found definitively that Nodes would be left in the text marker cache when a frame was replaced. 
Putting in changes in setView seems to cover that case
Comment 8 Build Bot 2015-01-04 01:08:26 PST
Comment on attachment 243923 [details]
patch

Attachment 243923 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.appspot.com/results/4651466987929600

New failing tests:
accessibility/frame-disconnect-textmarker-cache-crash.html
Comment 9 Build Bot 2015-01-04 01:08:34 PST
Created attachment 243925 [details]
Archive of layout-test-results from ews101 for mac-mountainlion

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews101  Port: mac-mountainlion  Platform: Mac OS X 10.8.5
Comment 10 Build Bot 2015-01-04 01:12:39 PST
Comment on attachment 243923 [details]
patch

Attachment 243923 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/5316678636273664

New failing tests:
accessibility/frame-disconnect-textmarker-cache-crash.html
Comment 11 Build Bot 2015-01-04 01:12:53 PST
Created attachment 243926 [details]
Archive of layout-test-results from ews106 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews106  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.5
Comment 12 Darin Adler 2015-01-04 13:48:30 PST
Comment on attachment 243923 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=243923&action=review

I am saying review+ but it looks like the new test is failing on the EWS machine. Please don’t land the patch until we diagnose and solve that problem.

> Source/WebCore/dom/Document.cpp:2084
> +    if (AXObjectCache* cache = existingAXObjectCache())
> +        cache->clearTextMarkerNodesInUse(this);

Is the call to this function in Frame::disconnectOwnerElement still needed, or can that be removed now that we have added this call?

> Source/WebCore/page/FrameView.cpp:300
> +        if (HTMLFrameOwnerElement* owner = frame().ownerElement())
> +            cache->childrenChanged(owner->renderer());

Is this correct even if owner->renderer() is null? Is owner->renderer() guaranteed not to be null?

> Source/WebCore/page/FrameView.cpp:302
>          cache->remove(this);

Maybe it would be slightly more efficient to do these two things in the reverse order?
Comment 13 Darin Adler 2015-01-04 13:50:23 PST
Comment on attachment 243923 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=243923&action=review

>> Source/WebCore/dom/Document.cpp:2084
>> +        cache->clearTextMarkerNodesInUse(this);
> 
> Is the call to this function in Frame::disconnectOwnerElement still needed, or can that be removed now that we have added this call?

It seems wasteful to do this for for the top level document. Maybe this should only be done if this != &topDocument()?
Comment 14 Darin Adler 2015-01-04 13:51:37 PST
We should consider testing in frames that are children of frames that are replaced; so not nodes in the frame that’s replaced itself, but nodes in one of its subframes. The code path there would be slightly different.
Comment 15 chris fleizach 2015-01-04 18:40:45 PST
(In reply to comment #12)
> Comment on attachment 243923 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=243923&action=review
> 
> I am saying review+ but it looks like the new test is failing on the EWS
> machine. Please don’t land the patch until we diagnose and solve that
> problem.
> 

Yep, it looks like my arbitrary wait for a page to load isn't working. I'll need a better way of checking the frame load

> > Source/WebCore/dom/Document.cpp:2084
> > +    if (AXObjectCache* cache = existingAXObjectCache())
> > +        cache->clearTextMarkerNodesInUse(this);
> 
> Is the call to this function in Frame::disconnectOwnerElement still needed,
> or can that be removed now that we have added this call?
> 

I will double check. 

> > Source/WebCore/page/FrameView.cpp:300
> > +        if (HTMLFrameOwnerElement* owner = frame().ownerElement())
> > +            cache->childrenChanged(owner->renderer());
> 
> Is this correct even if owner->renderer() is null? Is owner->renderer()
> guaranteed not to be null?

If owner->renderer() is nil, this operation will not do anything. I verified that the methods inside can handle null correctly. Do you think it's better form to check anyway in case those other methods change one day?

> 
> > Source/WebCore/page/FrameView.cpp:302
> >          cache->remove(this);
> 
> Maybe it would be slightly more efficient to do these two things in the
> reverse order?

I think it might be possible that if we do the remove first, the next childrenChanged call will fail because it will no longer think its a valid element
Comment 16 chris fleizach 2015-01-04 18:41:06 PST
(In reply to comment #13)
> Comment on attachment 243923 [details]
> patch
> 
> View in context:
> https://bugs.webkit.org/attachment.cgi?id=243923&action=review
> 
> >> Source/WebCore/dom/Document.cpp:2084
> >> +        cache->clearTextMarkerNodesInUse(this);
> > 
> > Is the call to this function in Frame::disconnectOwnerElement still needed, or can that be removed now that we have added this call?
> 
> It seems wasteful to do this for for the top level document. Maybe this
> should only be done if this != &topDocument()?

Yes this seems like a good idea
Comment 17 chris fleizach 2015-01-04 18:41:33 PST
(In reply to comment #14)
> We should consider testing in frames that are children of frames that are
> replaced; so not nodes in the frame that’s replaced itself, but nodes in one
> of its subframes. The code path there would be slightly different.

Sounds good. I'll add that into the tests
Comment 18 Darin Adler 2015-01-04 21:03:40 PST
Comment on attachment 243923 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=243923&action=review

>>> Source/WebCore/page/FrameView.cpp:300
>>> +            cache->childrenChanged(owner->renderer());
>> 
>> Is this correct even if owner->renderer() is null? Is owner->renderer() guaranteed not to be null?
> 
> If owner->renderer() is nil, this operation will not do anything. I verified that the methods inside can handle null correctly. Do you think it's better form to check anyway in case those other methods change one day?

No need to check; glad you looked. The renderer being null is not a case that needs to be optimized for performance, I don’t think, so I don’t think you should add the additional check.
Comment 19 chris fleizach 2015-01-05 18:06:20 PST
(In reply to comment #15)
> 
> > > Source/WebCore/dom/Document.cpp:2084
> > > +    if (AXObjectCache* cache = existingAXObjectCache())
> > > +        cache->clearTextMarkerNodesInUse(this);
> > 
> > Is the call to this function in Frame::disconnectOwnerElement still needed,
> > or can that be removed now that we have added this call?
> > 
> 
> I will double check. 
> 

Follow up on this. 

It appears that disconnectOwnerFrame is redundant. I see prepareForDestruction being called in all cases that disconnectOwnerFrame is, as well as many others (during general web browsing)
Comment 20 chris fleizach 2015-01-06 18:41:06 PST
Created attachment 244117 [details]
final patch for running through EWS
Comment 21 Build Bot 2015-01-06 19:21:19 PST
Comment on attachment 244117 [details]
final patch for running through EWS

Attachment 244117 [details] did not pass mac-ews (mac):
Output: http://webkit-queues.appspot.com/results/6607957332590592

New failing tests:
accessibility/frame-disconnect-textmarker-cache-crash.html
Comment 22 Build Bot 2015-01-06 19:21:25 PST
Created attachment 244124 [details]
Archive of layout-test-results from ews100 for mac-mountainlion

The attached test failures were seen while running run-webkit-tests on the mac-ews.
Bot: ews100  Port: mac-mountainlion  Platform: Mac OS X 10.8.5
Comment 23 Build Bot 2015-01-06 19:34:47 PST
Comment on attachment 244117 [details]
final patch for running through EWS

Attachment 244117 [details] did not pass mac-wk2-ews (mac-wk2):
Output: http://webkit-queues.appspot.com/results/6334299196358656

New failing tests:
accessibility/frame-disconnect-textmarker-cache-crash.html
Comment 24 Build Bot 2015-01-06 19:34:51 PST
Created attachment 244126 [details]
Archive of layout-test-results from ews106 for mac-mountainlion-wk2

The attached test failures were seen while running run-webkit-tests on the mac-wk2-ews.
Bot: ews106  Port: mac-mountainlion-wk2  Platform: Mac OS X 10.8.5
Comment 25 chris fleizach 2015-01-06 22:59:05 PST
Created attachment 244141 [details]
final patch with all the files
Comment 26 chris fleizach 2015-01-07 08:56:15 PST
http://trac.webkit.org/changeset/178038