Bug 263974

Summary: REGRESSION(269895@main):[WPE] 52 new API tests failures
Product: WebKit Reporter: Carlos Alberto Lopez Perez <clopez>
Component: WPE WebKitAssignee: Chris Dumez <cdumez>
Status: RESOLVED FIXED    
Severity: Normal CC: bugs-noreply, cdumez, csaavedra, mcatanzaro, vitaly
Priority: P2    
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   

Description Carlos Alberto Lopez Perez 2023-10-31 08:52:14 PDT
Some commit in the last days caused a regression on the WPE API tests

- First good run: 269539@main
https://build.webkit.org/#/builders/40/builds/12470
API tests: Ran 1934 tests of 1934 with 1934 successful


- First bad run: 269937@main
https://build.webkit.org/#/builders/40/builds/12596
API tests: Ran 1936 tests of 1936 with 1884 successful
# 1936−1884 => 52 new failures


There was an issue with the WPE Release test bot in the
last days, that is why the interval is so big because
the bot was broken.

In order to reduce the interval and help bisecting this
check also the results on the (still in WIP) WPE EWS API tests bots:
https://ews-build.webkit.org/#/builders/41

BTW, There are a few more unexpected failures on the EWS
API tests bots that on the release test bot (like WPEQt tests)
it would be great to garden those as well.
Comment 1 Carlos Alberto Lopez Perez 2023-10-31 08:59:43 PDT
Seems GTK API tests are also failing,
This can help to bisect it: https://build.webkit.org/#/builders/57?numbuilds=300
Comment 2 Claudio Saavedra 2023-10-31 09:31:31 PDT
This was introduced with https://github.com/WebKit/WebKit/pull/19674
Comment 3 Chris Dumez 2023-10-31 09:38:32 PDT
Are these crashes? If so, could I get a stack trace?


We didn't see any issues on Cocoa bots so I suspect this is glib-specific somehow.
Comment 4 Chris Dumez 2023-10-31 09:41:31 PDT
(In reply to Chris Dumez from comment #3)
> Are these crashes? If so, could I get a stack trace?
> 
> 
> We didn't see any issues on Cocoa bots so I suspect this is glib-specific
> somehow.

Looks like they are crashes but I couldn't get a stack trace. My bet is that we're ref'ing an object that has started destruction in one of the code path used by this port. It would likely be trivial to fix if I had a stack trace.
Comment 5 Claudio Saavedra 2023-10-31 10:02:28 PDT
Vitaly already has a patch, but he might be off for the day, so I suspect we will have a MR from him tomorrow.
Comment 6 Claudio Saavedra 2023-11-01 05:50:27 PDT
OK, I think he might not be able to upload it today, so here's a stacktrace:

#0  0x00007f5183910b81 in WebKit::WebProcessProxy::protectedProcessPool() const () at /app/webkit/WebKitBuild/GTK/Release/lib/libwebkit2gtk-4.1.so.0
#1  0x00007f5183929a51 in WebKit::WebProcessProxy::~WebProcessProxy() () at /app/webkit/WebKitBuild/GTK/Release/lib/libwebkit2gtk-4.1.so.0
#2  0x00007f518392a509 in WebKit::WebProcessProxy::~WebProcessProxy() () at /app/webkit/WebKitBuild/GTK/Release/lib/libwebkit2gtk-4.1.so.0
#3  0x00007f518385e7e6 in WTF::ThreadSafeRefCounted<WebKit::AuxiliaryProcessProxy, (WTF::DestructionThread)2>::deref() const () at /app/webkit/WebKitBuild/GTK/Release/lib/libwebkit2gtk-4.1.so.0
#4  0x00007f518383910f in WTF::Detail::CallableWrapper<WebKit::AuxiliaryProcessProxy::checkForResponsiveness(WTF::CompletionHandler<void ()>&&, WebKit::AuxiliaryProcessProxy::UseLazyStop)::{lambda()#1}::operator()()::{lambda()#1}, void>::call() () at /app/webkit/WebKitBuild/GTK/Release/lib/libwebkit2gtk-4.1.so.0
#5  0x00007f5180cb0f8a in WTF::RunLoop::performWork() () at /app/webkit/WebKitBuild/GTK/Release/lib/libjavascriptcoregtk-4.1.so.0
#6  0x00007f5180d295c9 in WTF::RunLoop::RunLoop()::{lambda(void*)#1}::_FUN(void*) () at /app/webkit/WebKitBuild/GTK/Release/lib/libjavascriptcoregtk-4.1.so.0
#7  0x00007f5180d2a04f in WTF::RunLoop::{lambda(_GSource*, int (*)(void*), void*)#1}::_FUN(_GSource*, int (*)(void*), void*) () at /app/webkit/WebKitBuild/GTK/Release/lib/libjavascriptcoregtk-4.1.so.0
#8  0x00007f5187153527 in g_main_dispatch (context=0x5634df752db0) at ../glib/gmain.c:3460
#9  g_main_context_dispatch (context=0x5634df752db0) at ../glib/gmain.c:4200
#10 0x00007f51871b0888 in g_main_context_iterate.constprop.0 (context=0x5634df752db0, block=<optimized out>, dispatch=1, self=<optimized out>) at ../glib/gmain.c:4276
#11 0x00007f5187152d7f in g_main_loop_run (loop=0x5634dfa98150) at ../glib/gmain.c:4479
#12 0x00005634de46a61a in testWebViewAuthenticationNoCredential(AuthenticationTest*, void const*) ()
#13 0x00007f518718301e in test_case_run (tc=0x5634df790bf0) at ../glib/gtestutils.c:3114
#14 g_test_run_suite_internal (suite=suite@entry=0x5634df790cc0, path=0x0) at ../glib/gtestutils.c:3209
#15 0x00007f5187182d1b in g_test_run_suite_internal (suite=suite@entry=0x5634df790c60, path=0x0) at ../glib/gtestutils.c:3228
#16 0x00007f5187182d1b in g_test_run_suite_internal (suite=suite@entry=0x5634df790bb0, path=path@entry=0x0) at ../glib/gtestutils.c:3228
#17 0x00007f51871835b2 in g_test_run_suite (suite=0x5634df790bb0) at ../glib/gtestutils.c:3308
#18 0x00007f518717d756 in g_test_run () at ../glib/gtestutils.c:2415
#19 g_test_run () at ../glib/gtestutils.c:2402
#20 0x00005634de47144f in main ()
Comment 7 Michael Catanzaro 2023-11-01 06:33:08 PDT
Well that code looks innocent. There must be a refcounting error somewhere earlier.
Comment 8 Chris Dumez 2023-11-01 07:46:03 PDT
Pull request: https://github.com/WebKit/WebKit/pull/19834
Comment 9 EWS 2023-11-01 07:59:30 PDT
Committed 270054@main (dd5384202bfa): <https://commits.webkit.org/270054@main>

Reviewed commits have been landed. Closing PR #19834 and removing active labels.
Comment 10 Claudio Saavedra 2023-11-01 08:00:10 PDT
Now with symbols:

#0  WTFCrash() () at /app/webkit/Source/WTF/wtf/Assertions.cpp:333
#1  0x00007fb00d7eb15d in WTFCrashWithInfo(int, char const*, char const*, int) () at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Assertions.h:778
#2  0x00007fb00e95bd08 in WebKit::WebProcessProxy::processPool() const (this=0x7faf56000c00) at /app/webkit/Source/WebKit/UIProcess/WebProcessProxy.cpp:2107
#3  0x00007fb00e95bd38 in WebKit::WebProcessProxy::protectedProcessPool() const (this=0x7faf56000c00) at /app/webkit/Source/WebKit/UIProcess/WebProcessProxy.cpp:2113
#4  0x00007fb00e95058a in WebKit::WebProcessProxy::~WebProcessProxy() (this=0x7faf56000c00, __in_chrg=<optimized out>) at /app/webkit/Source/WebKit/UIProcess/WebProcessProxy.cpp:340
#5  0x00007fb00e950a28 in WebKit::WebProcessProxy::~WebProcessProxy() (this=0x7faf56000c00, __in_chrg=<optimized out>) at /app/webkit/Source/WebKit/UIProcess/WebProcessProxy.cpp:358
#6  0x00007fb00dbdf82c in WTF::ThreadSafeRefCounted<WebKit::AuxiliaryProcessProxy, (WTF::DestructionThread)2>::deref() const::{lambda()#1}::operator()() const (__closure=0x7fafee002408)
    at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/ThreadSafeRefCounted.h:115
#7  0x00007fb00dbf0888 in WTF::Detail::CallableWrapper<WTF::ThreadSafeRefCounted<WebKit::AuxiliaryProcessProxy, (WTF::DestructionThread)2>::deref() const::{lambda()#1}, void>::call() (this=0x7fafee002400)
    at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:53
#8  0x00007fb001d72845 in WTF::Function<void ()>::operator()() const (this=0x7ffed4c50010) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:82
#9  0x00007fb0033b2fae in WTF::ensureOnMainRunLoop(WTF::Function<void ()>&&) (function=...) at /app/webkit/Source/WTF/wtf/MainThread.cpp:75
#10 0x00007fb00dbdf890 in WTF::ThreadSafeRefCounted<WebKit::AuxiliaryProcessProxy, (WTF::DestructionThread)2>::deref() const (this=0x7faf56000c0c)
    at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/ThreadSafeRefCounted.h:124
#11 0x00007fb00dbdf660 in WebKit::AuxiliaryProcessProxy::deref() (this=0x7faf56000c00) at /app/webkit/Source/WebKit/UIProcess/AuxiliaryProcessProxy.h:174
#12 0x00007fb00e7bfa42 in WTF::DefaultRefDerefTraits<WebKit::WebProcessProxy>::derefIfNotNull(WebKit::WebProcessProxy*) (ptr=0x7faf56000c00) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/RefPtr.h:43
#13 0x00007fb00e7b66ee in WTF::RefPtr<WebKit::WebProcessProxy, WTF::RawPtrTraits<WebKit::WebProcessProxy>, WTF::DefaultRefDerefTraits<WebKit::WebProcessProxy> >::~RefPtr()
    (this=0x7ffed4c500b8, __in_chrg=<optimized out>) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/RefPtr.h:75
#14 0x00007fb00e95a151 in operator()() (__closure=0x7fafee134648) at /app/webkit/Source/WebKit/UIProcess/WebProcessProxy.cpp:1881
#15 0x00007fb00e9787ac in WTF::Detail::CallableWrapper<WebKit::WebProcessProxy::isResponsive(WTF::CompletionHandler<void(bool)>&&)::<lambda()>, void>::call(void) (this=0x7fafee134640)
    at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:53
#16 0x00007fb00e1162e1 in WTF::Function<void ()>::operator()() const (this=0x7ffed4c50190) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:82
#17 0x00007fb00e11597a in WTF::CompletionHandler<void ()>::operator()() (this=0x7fafee00d4e8) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/CompletionHandler.h:75
#18 0x00007fb00e71018a in operator()() (__closure=0x7fafee00d4d8) at /app/webkit/Source/WebKit/UIProcess/AuxiliaryProcessProxy.cpp:479
#19 0x00007fb00e7215fa in WTF::Detail::CallableWrapper<WebKit::AuxiliaryProcessProxy::checkForResponsiveness(WTF::CompletionHandler<void()>&&, UseLazyStop)::<lambda()> mutable::<lambda()>, void>::call(void)
    (this=0x7fafee00d4d0) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:53
#20 0x00007fb001d72845 in WTF::Function<void ()>::operator()() const (this=0x7ffed4c50270) at /app/webkit/WebKitBuild/GTK/Debug/WTF/Headers/wtf/Function.h:82
#21 0x00007fb0033decf7 in WTF::RunLoop::performWork() (this=0x7fafee0240c0) at /app/webkit/Source/WTF/wtf/RunLoop.cpp:147
#22 0x00007fb003493300 in operator()(gpointer) const (__closure=0x0, userData=0x7fafee0240c0) at /app/webkit/Source/WTF/wtf/glib/RunLoopGLib.cpp:80
#23 0x00007fb003493324 in _FUN(gpointer) () at /app/webkit/Source/WTF/wtf/glib/RunLoopGLib.cpp:82
#24 0x00007fb003493293 in operator()(GSource*, GSourceFunc, gpointer) const (__closure=0x0, source=0x56546e248520, callback=0x7fb003493307 <_FUN(gpointer)>, userData=0x7fafee0240c0)
    at /app/webkit/Source/WTF/wtf/glib/RunLoopGLib.cpp:53
#25 0x00007fb0034932e1 in _FUN(GSource*, GSourceFunc, gpointer) () at /app/webkit/Source/WTF/wtf/glib/RunLoopGLib.cpp:56
#26 0x00007fb0159c3527 in g_main_dispatch (context=0x56546e209fb0) at ../glib/gmain.c:3460
#27 g_main_context_dispatch (context=0x56546e209fb0) at ../glib/gmain.c:4200
#28 0x00007fb015a20888 in g_main_context_iterate.constprop.0 (context=0x56546e209fb0, block=<optimized out>, dispatch=1, self=<optimized out>) at ../glib/gmain.c:4276
#29 0x00007fb0159c2d7f in g_main_loop_run (loop=0x56546e336190) at ../glib/gmain.c:4479
#30 0x000056546d46e4be in AuthenticationTest::waitForAuthenticationRequest() (this=0x56546e247fb0) at /app/webkit/Tools/TestWebKitAPI/Tests/WebKitGLib/TestAuthentication.cpp:101
#31 0x000056546d46a092 in testWebViewAuthenticationSuccess(AuthenticationTest*, gconstpointer) (test=0x56546e247fb0) at /app/webkit/Tools/TestWebKitAPI/Tests/WebKitGLib/TestAuthentication.cpp:301
#32 0x00007fb0159f301e in test_case_run (tc=0x56546e247980) at ../glib/gtestutils.c:3114
#33 g_test_run_suite_internal (suite=suite@entry=0x56546e2476a0, path=0x0) at ../glib/gtestutils.c:3209
#34 0x00007fb0159f2d1b in g_test_run_suite_internal (suite=suite@entry=0x56546e247640, path=0x0) at ../glib/gtestutils.c:3228
#35 0x00007fb0159f2d1b in g_test_run_suite_internal (suite=suite@entry=0x56546e247590, path=path@entry=0x0) at ../glib/gtestutils.c:3228
#36 0x00007fb0159f35b2 in g_test_run_suite (suite=0x56546e247590) at ../glib/gtestutils.c:3308
#37 0x00007fb0159ed756 in g_test_run () at ../glib/gtestutils.c:2415
#38 g_test_run () at ../glib/gtestutils.c:2402
#39 0x000056546d476773 in main(int, char**) (argc=1, argv=0x7ffed4c50ab8) at /app/webkit/Tools/TestWebKitAPI/glib/WebKitGLib/TestMain.cpp:146
Comment 11 Chris Dumez 2023-11-01 08:06:01 PDT
Please reopen if my speculative fix didn't do the trick. I reverted my changes to the WebProcessProxy destructor since that's where it was crashing. I don't have an explanation why my changes to the destructor would have caused this though :/
Comment 12 Claudio Saavedra 2023-11-01 08:09:46 PDT
My stacktrace is from before your revert, I just pasted it here in case it helps to fix the issue.

It seems like the specific change that runs into the issue is this:

 #if HAVE(DISPLAY_LINK)
-    processPool().displayLinks().stopDisplayLinks(m_displayLinkClient);
+    protectedProcessPool()->displayLinks().stopDisplayLinks(m_displayLinkClient);
 #endif
Comment 13 Chris Dumez 2023-11-01 08:11:50 PDT
(In reply to Claudio Saavedra from comment #12)
> My stacktrace is from before your revert, I just pasted it here in case it
> helps to fix the issue.
> 
> It seems like the specific change that runs into the issue is this:
> 
>  #if HAVE(DISPLAY_LINK)
> -    processPool().displayLinks().stopDisplayLinks(m_displayLinkClient);
> +   
> protectedProcessPool()->displayLinks().stopDisplayLinks(m_displayLinkClient);
>  #endif

Yes, that is what I reverted.

The crash would make sense if we saw the WebProcessPool destructor in the stack trace (trying to ref an object that started destruction). However, the WebProcessPool destructor doesn't show in the trace. It may be worth investigating even if my partial revert fixes it because the code may already be incorrect in some way (just lucky enough not to usually crash).
Comment 14 Michael Catanzaro 2023-11-01 08:32:17 PDT
That partial revert in 270054@main is not going to fix anything.

(In reply to Chris Dumez from comment #13)
> Yes, that is what I reverted.
> 
> The crash would make sense if we saw the WebProcessPool destructor in the
> stack trace (trying to ref an object that started destruction). However, the
> WebProcessPool destructor doesn't show in the trace. It may be worth
> investigating even if my partial revert fixes it because the code may
> already be incorrect in some way (just lucky enough not to usually crash).

I think you're looking at the first backtrace that Claudio posted, which is lower-quality. The second backtrace shows the problem more clearly. It hits the assertion here:

WebProcessPool& WebProcessProxy::processPool() const
{
    ASSERT(m_processPool);
    return *m_processPool.get();
}

So m_processPool is already nullptr. It's a WeakOrStrongPtr<WebProcessPool>, which I've never seen that before, but clearly it's expected that it may be a weak pointer and therefore it may become nullptr at any time. Looks like we should not call WebProcessProxy::processPool or WebProcessProxy::protectedProcessPool or access m_processPool without first checking to ensure it is non-null. There are missing null checks throughout the entire file.

This is actually good though, because it's not a refcounting error or a dangling pointer, so we don't need to scratch our heads wondering where the bug is.
Comment 15 Chris Dumez 2023-11-01 08:37:00 PDT
(In reply to Michael Catanzaro from comment #14)
> That partial revert in 270054@main is not going to fix anything.
> 
> (In reply to Chris Dumez from comment #13)
> > Yes, that is what I reverted.
> > 
> > The crash would make sense if we saw the WebProcessPool destructor in the
> > stack trace (trying to ref an object that started destruction). However, the
> > WebProcessPool destructor doesn't show in the trace. It may be worth
> > investigating even if my partial revert fixes it because the code may
> > already be incorrect in some way (just lucky enough not to usually crash).
> 
> I think you're looking at the first backtrace that Claudio posted, which is
> lower-quality. The second backtrace shows the problem more clearly. It hits
> the assertion here:
> 
> WebProcessPool& WebProcessProxy::processPool() const
> {
>     ASSERT(m_processPool);
>     return *m_processPool.get();
> }
> 
> So m_processPool is already nullptr. It's a WeakOrStrongPtr<WebProcessPool>,
> which I've never seen that before, but clearly it's expected that it may be
> a weak pointer and therefore it may become nullptr at any time. Looks like
> we should not call WebProcessProxy::processPool or
> WebProcessProxy::protectedProcessPool or access m_processPool without first
> checking to ensure it is non-null. There are missing null checks throughout
> the entire file.
> 
> This is actually good though, because it's not a refcounting error or a
> dangling pointer, so we don't need to scratch our heads wondering where the
> bug is.

How do you explain the regression from my patch though? I didn't remove any null checks AFAICT.
Comment 16 Chris Dumez 2023-11-01 08:37:26 PDT
(In reply to Chris Dumez from comment #15)
> (In reply to Michael Catanzaro from comment #14)
> > That partial revert in 270054@main is not going to fix anything.
> > 
> > (In reply to Chris Dumez from comment #13)
> > > Yes, that is what I reverted.
> > > 
> > > The crash would make sense if we saw the WebProcessPool destructor in the
> > > stack trace (trying to ref an object that started destruction). However, the
> > > WebProcessPool destructor doesn't show in the trace. It may be worth
> > > investigating even if my partial revert fixes it because the code may
> > > already be incorrect in some way (just lucky enough not to usually crash).
> > 
> > I think you're looking at the first backtrace that Claudio posted, which is
> > lower-quality. The second backtrace shows the problem more clearly. It hits
> > the assertion here:
> > 
> > WebProcessPool& WebProcessProxy::processPool() const
> > {
> >     ASSERT(m_processPool);
> >     return *m_processPool.get();
> > }
> > 
> > So m_processPool is already nullptr. It's a WeakOrStrongPtr<WebProcessPool>,
> > which I've never seen that before, but clearly it's expected that it may be
> > a weak pointer and therefore it may become nullptr at any time. Looks like
> > we should not call WebProcessProxy::processPool or
> > WebProcessProxy::protectedProcessPool or access m_processPool without first
> > checking to ensure it is non-null. There are missing null checks throughout
> > the entire file.
> > 
> > This is actually good though, because it's not a refcounting error or a
> > dangling pointer, so we don't need to scratch our heads wondering where the
> > bug is.
> 
> How do you explain the regression from my patch though? I didn't remove any
> null checks AFAICT.

Please, let's not speculatively add null checks without understanding the regression.
Comment 17 Chris Dumez 2023-11-01 08:53:04 PDT
(In reply to Chris Dumez from comment #16)
> (In reply to Chris Dumez from comment #15)
> > (In reply to Michael Catanzaro from comment #14)
> > > That partial revert in 270054@main is not going to fix anything.
> > > 
> > > (In reply to Chris Dumez from comment #13)
> > > > Yes, that is what I reverted.
> > > > 
> > > > The crash would make sense if we saw the WebProcessPool destructor in the
> > > > stack trace (trying to ref an object that started destruction). However, the
> > > > WebProcessPool destructor doesn't show in the trace. It may be worth
> > > > investigating even if my partial revert fixes it because the code may
> > > > already be incorrect in some way (just lucky enough not to usually crash).
> > > 
> > > I think you're looking at the first backtrace that Claudio posted, which is
> > > lower-quality. The second backtrace shows the problem more clearly. It hits
> > > the assertion here:
> > > 
> > > WebProcessPool& WebProcessProxy::processPool() const
> > > {
> > >     ASSERT(m_processPool);
> > >     return *m_processPool.get();
> > > }
> > > 
> > > So m_processPool is already nullptr. It's a WeakOrStrongPtr<WebProcessPool>,
> > > which I've never seen that before, but clearly it's expected that it may be
> > > a weak pointer and therefore it may become nullptr at any time. Looks like
> > > we should not call WebProcessProxy::processPool or
> > > WebProcessProxy::protectedProcessPool or access m_processPool without first
> > > checking to ensure it is non-null. There are missing null checks throughout
> > > the entire file.
> > > 
> > > This is actually good though, because it's not a refcounting error or a
> > > dangling pointer, so we don't need to scratch our heads wondering where the
> > > bug is.
> > 
> > How do you explain the regression from my patch though? I didn't remove any
> > null checks AFAICT.
> 
> Please, let's not speculatively add null checks without understanding the
> regression.

Ok, I believe I have an explanation. I added a couple of `protectedThis` inside checkForResponsiveness() lambdas. This means that `this` (the WebProcessProxy) now outlives the calls to the isResponsive handlers. Also note that we're not protecting the WebProcessPool, so if this is a pre-warmed / cached process, it is possible for the WebProcessProxy to outlive its WebProcessPool, thus hitting the null Deref in the destructor.

We could fix it by tweaking those checkForResponsiveness() lambda to make sure we don't keep `this` alive past calling the isReponsive handlers.

I'm happy to make this change unless someone else wants to look into this. Let me know.
Comment 18 Chris Dumez 2023-11-01 09:12:56 PDT
Pull request: https://github.com/WebKit/WebKit/pull/19837
Comment 19 Michael Catanzaro 2023-11-01 09:15:03 PDT
(In reply to Chris Dumez from comment #17)
> Ok, I believe I have an explanation. I added a couple of `protectedThis`
> inside checkForResponsiveness() lambdas. This means that `this` (the
> WebProcessProxy) now outlives the calls to the isResponsive handlers. Also
> note that we're not protecting the WebProcessPool, so if this is a
> pre-warmed / cached process, it is possible for the WebProcessProxy to
> outlive its WebProcessPool, thus hitting the null Deref in the destructor.
> 
> We could fix it by tweaking those checkForResponsiveness() lambda to make
> sure we don't keep `this` alive past calling the isReponsive handlers.
> 
> I'm happy to make this change unless someone else wants to look into this.
> Let me know.

Interesting!

That said, it won't be sufficient to fix this code, because we still need to make changes in ~WebProcessProxy to expect the WebProcessPool to be already destroyed anyway, because if it were expected for m_processPool to be unconditionally still alive in the destructor, then there would have never been any need to use the weak pointer in the first place. i.e. if you want to guarantee that WebProcessProxy always outlives its WebProcessPool, then it's wrong for m_processPool to use a WeakOrStrongPtr.

I would try:

#if HAVE(DISPLAY_LINK)
    if (m_webProcessPool)
        protectedProcessPool()->displayLinks().stopDisplayLinks(m_displayLinkClient);
#endif

I also noticed we have WebProcessProxy::processPoolIfExists, but that will log an error, which is not desired here.

> Please, let's not speculatively add null checks without understanding the regression.

Adding null checks everywhere should be correct because this is how weak pointers are intended to be used, right? The WebProcessProxy corresponding to a cached or prewarmed web process does not own its WebProcessPool and *expects* the WebProcessPool to disappear out from under it, causing m_processPool to become nullptr. If that expectation is not correct, then it should use a strong reference rather than a weak pointer, but a strong ref wouldn't be appropriate here because the goal is for the cached/prewarmed process to not keep WebProcessPool alive. If there are particular places where we expect m_processPool should still be alive and does not need to be checked, then we should add asserts to guarantee that rather than null checks (although in fairness, that's already done inside WebProcessProxy::processPool). Or we could check that it's a strong ref rather than a weak ref if that's expected in certain places. But in general, surely there should surely be some sort of check before assuming m_processPool is valid.
Comment 20 Chris Dumez 2023-11-01 09:15:37 PDT
(In reply to Michael Catanzaro from comment #19)
> (In reply to Chris Dumez from comment #17)
> > Ok, I believe I have an explanation. I added a couple of `protectedThis`
> > inside checkForResponsiveness() lambdas. This means that `this` (the
> > WebProcessProxy) now outlives the calls to the isResponsive handlers. Also
> > note that we're not protecting the WebProcessPool, so if this is a
> > pre-warmed / cached process, it is possible for the WebProcessProxy to
> > outlive its WebProcessPool, thus hitting the null Deref in the destructor.
> > 
> > We could fix it by tweaking those checkForResponsiveness() lambda to make
> > sure we don't keep `this` alive past calling the isReponsive handlers.
> > 
> > I'm happy to make this change unless someone else wants to look into this.
> > Let me know.
> 
> Interesting!
> 
> That said, it won't be sufficient to fix this code, because we still need to
> make changes in ~WebProcessProxy to expect the WebProcessPool to be already
> destroyed anyway, because if it were expected for m_processPool to be
> unconditionally still alive in the destructor, then there would have never
> been any need to use the weak pointer in the first place. i.e. if you want
> to guarantee that WebProcessProxy always outlives its WebProcessPool, then
> it's wrong for m_processPool to use a WeakOrStrongPtr.
> 
> I would try:
> 
> #if HAVE(DISPLAY_LINK)
>     if (m_webProcessPool)
>        
> protectedProcessPool()->displayLinks().stopDisplayLinks(m_displayLinkClient);
> #endif
> 
> I also noticed we have WebProcessProxy::processPoolIfExists, but that will
> log an error, which is not desired here.
> 
> > Please, let's not speculatively add null checks without understanding the regression.
> 
> Adding null checks everywhere should be correct because this is how weak
> pointers are intended to be used, right? The WebProcessProxy corresponding
> to a cached or prewarmed web process does not own its WebProcessPool and
> *expects* the WebProcessPool to disappear out from under it, causing
> m_processPool to become nullptr. If that expectation is not correct, then it
> should use a strong reference rather than a weak pointer, but a strong ref
> wouldn't be appropriate here because the goal is for the cached/prewarmed
> process to not keep WebProcessPool alive. If there are particular places
> where we expect m_processPool should still be alive and does not need to be
> checked, then we should add asserts to guarantee that rather than null
> checks (although in fairness, that's already done inside
> WebProcessProxy::processPool). Or we could check that it's a strong ref
> rather than a weak ref if that's expected in certain places. But in general,
> surely there should surely be some sort of check before assuming
> m_processPool is valid.

Please check my latest PR
Comment 21 Michael Catanzaro 2023-11-01 09:16:03 PDT
(In reply to Michael Catanzaro from comment #19)
> That said, it won't be sufficient to fix this code, because we still need to
> make changes in ~WebProcessProxy to expect the WebProcessPool to be already
> destroyed anyway

I see you've done this in your pull request.
Comment 22 Chris Dumez 2023-11-01 09:19:24 PDT
(In reply to Michael Catanzaro from comment #19)
> (In reply to Chris Dumez from comment #17)
> > Ok, I believe I have an explanation. I added a couple of `protectedThis`
> > inside checkForResponsiveness() lambdas. This means that `this` (the
> > WebProcessProxy) now outlives the calls to the isResponsive handlers. Also
> > note that we're not protecting the WebProcessPool, so if this is a
> > pre-warmed / cached process, it is possible for the WebProcessProxy to
> > outlive its WebProcessPool, thus hitting the null Deref in the destructor.
> > 
> > We could fix it by tweaking those checkForResponsiveness() lambda to make
> > sure we don't keep `this` alive past calling the isReponsive handlers.
> > 
> > I'm happy to make this change unless someone else wants to look into this.
> > Let me know.
> 
> Interesting!
> 
> That said, it won't be sufficient to fix this code, because we still need to
> make changes in ~WebProcessProxy to expect the WebProcessPool to be already
> destroyed anyway, because if it were expected for m_processPool to be
> unconditionally still alive in the destructor, then there would have never
> been any need to use the weak pointer in the first place. i.e. if you want
> to guarantee that WebProcessProxy always outlives its WebProcessPool, then
> it's wrong for m_processPool to use a WeakOrStrongPtr.
> 
> I would try:
> 
> #if HAVE(DISPLAY_LINK)
>     if (m_webProcessPool)
>        
> protectedProcessPool()->displayLinks().stopDisplayLinks(m_displayLinkClient);
> #endif
> 
> I also noticed we have WebProcessProxy::processPoolIfExists, but that will
> log an error, which is not desired here.
> 
> > Please, let's not speculatively add null checks without understanding the regression.
> 
> Adding null checks everywhere should be correct because this is how weak
> pointers are intended to be used, right? The WebProcessProxy corresponding
> to a cached or prewarmed web process does not own its WebProcessPool and
> *expects* the WebProcessPool to disappear out from under it, causing
> m_processPool to become nullptr. If that expectation is not correct, then it
> should use a strong reference rather than a weak pointer, but a strong ref
> wouldn't be appropriate here because the goal is for the cached/prewarmed
> process to not keep WebProcessPool alive. If there are particular places
> where we expect m_processPool should still be alive and does not need to be
> checked, then we should add asserts to guarantee that rather than null
> checks (although in fairness, that's already done inside
> WebProcessProxy::processPool). Or we could check that it's a strong ref
> rather than a weak ref if that's expected in certain places. But in general,
> surely there should surely be some sort of check before assuming
> m_processPool is valid.

The intention of the weak pointer was not for the WebProcessProxy to outlive the WebProcessPool. It was so that a WebProcessProxy would not necessarily keep the process pool alive (then the destruction of the WebProcessPool would destroy those processes). We had to do this when we introduced non-required WebProcesses (pre-warmed, cached ones) to avoid leaking the process pools.

I do think that the current design is too fragile and that we should give it some thought. However, I don't think we should refactor the code as part of an urgent regression fix.

As a result, I am uploaded a targeted fix which I believe will fix the regression.
We can follow-up to improve on the design.
Comment 23 Chris Dumez 2023-11-01 09:21:15 PDT
(In reply to Michael Catanzaro from comment #21)
> (In reply to Michael Catanzaro from comment #19)
> > That said, it won't be sufficient to fix this code, because we still need to
> > make changes in ~WebProcessProxy to expect the WebProcessPool to be already
> > destroyed anyway
> 
> I see you've done this in your pull request.

Yes, please use your review powers if you approve, to get WPE back in shape quickly.
Comment 24 EWS 2023-11-01 12:19:45 PDT
Committed 270070@main (42d022005f54): <https://commits.webkit.org/270070@main>

Reviewed commits have been landed. Closing PR #19837 and removing active labels.