RESOLVED CONFIGURATION CHANGED 209846
[WPE] Debug bot timeouts with many unresponsive webprocess errors
https://bugs.webkit.org/show_bug.cgi?id=209846
Summary [WPE] Debug bot timeouts with many unresponsive webprocess errors
Lauro Moura
Reported 2020-03-31 21:44:04 PDT
Right after starting the tests, many of the workers fail to run the first tests, with messages like: 04:32:07.666 78279 WPEWebProcess is unresponsive, pid = None 04:32:07.668 78279 worker/8 crypto/workers/crypto-gc-worker.html output stderr lines: 04:32:07.668 78279 <unknown> - TestController::run - Failed to reset state to consistent values 04:32:07.668 78279 #PROCESS UNRESPONSIVE - WPEWebProcess 04:32:07.693 78279 "ruby --version" took 0.02s 04:32:07.838 78279 "ruby -I /home/buildbot/wpe/wpe-linux-64-debug-tests/build/Websites/bugs.webkit.org/PrettyPatch /home/buildbot/wpe/wpe-linux-64-debug-tests/build/Websites/bugs.webkit.org/PrettyPatch/prettify.rb /home/buildbot/wpe/wpe-linux-64-debug-tests/build/layout-test-results/crypto/workers/crypto-gc-worker-diff.txt" took 0.14s It seems to have started around build #3470 (containing revisions r256477 through r256502, Feb 12th). Before it, this error happened rather sparsely (see list below). Since this build, it started occurring right from the start quite frequently. Occurrences of this error in the builds before: * 3457 - 0 - Earliest build checked * ... No ocurrences between * 3463 - 0 * 3464 - 1 - transitions/negative-delay.html, late in the test run * 3465 - 1 - storage/indexeddb/pending-version-change-on-exit-private.html, late in the test run * 3466 - 0 * 3467 - 0 * 3468 - 1 - storage/indexeddb/optional-arguments.html, late in the test run * 3469 - 1 - imported/w3c/web-platform-tests/html/browsers/history/the-history-interface/history_go_minus.html * 3470 - 14 - All 10 workers fail in the first test they run, with more failures later, triggering the circuit breaker. Link to build: https://build.webkit.org/builders/WPE%20Linux%2064-bit%20Debug%20%28Tests%29/builds/3470 Still have not managed to reproduce it locally.
Attachments
Lauro Moura
Comment 1 2020-03-31 21:54:17 PDT
Correction: Managed to reproduce the issue with jhbuild (which is still used in the bots).
Lauro Moura
Comment 2 2020-04-08 20:15:01 PDT
Hard to reproduce consistently on my setup. Sometimes happening when a lot of parallel tests are run (like 25 instances in my 8 core laptop, but rarely). Directly on the bot it is failing more consistently with the following command line: $ python ./Tools/Scripts/run-webkit-tests --no-build --no-show-results --no-new-test-results --clobber-old-results --exit-after-n-crashes-or-timeouts 2 --exit-after-n-failures 5 --debug --wpe --results-directory layout-test-results --debug-rwt-logging --child-processes=10 --iterations=20 --fully-parallel --no-http fast/dom/image-object.html With WEBKIT_DEBUG=all, the crashing test gives this output, which up to the error message seems to be similar to what a normal run outputs: ``` UNIMPLEMENTED: ../../Source/WebKit/UIProcess/WebPreferences.cpp(201) : void WebKit::WebPreferences::platformInitializeStore() UNIMPLEMENTED: ../../Source/WebKit/UIProcess/WebPreferences.cpp(244) : bool WebKit::WebPreferences::platformGetBoolUserValueForKey(const WTF::String&, bool&) UNIMPLEMENTED: ../../Source/WebKit/UIProcess/WebPreferences.cpp(250) : bool WebKit::WebPreferences::platformGetUInt32UserValueForKey(const WTF::String&, uint32_t&) UNIMPLEMENTED: ../../Source/WebKit/UIProcess/WebPreferences.cpp(213) : void WebKit::WebPreferences::platformUpdateBoolValueForKey(const WTF::String&, bool) UNIMPLEMENTED: ../../Source/WebKit/UIProcess/WebPreferences.cpp(208) : void WebKit::WebPreferences::platformUpdateStringValueForKey(const WTF::String&, const WTF::String&) (Back/Forward) Created WebBackForwardList 0x7fe9502ed3b8 UNIMPLEMENTED: ../../Source/WebKit/UIProcess/wpe/WebPageProxyWPE.cpp(44) : void WebKit::WebPageProxy::platformInitialize() (NetworkProcess) synchronizing cache (NetworkProcess) opened cache storage, success 1 (NetworkProcess) blob synchronization completed approximateSize=0 (NetworkProcess) cache synchronization completed size=0 recordCount=0 (NetworkProcess) synchronizing cache (NetworkProcess) opened cache storage, success 1 (NetworkProcess) blob synchronization completed approximateSize=0 (NetworkProcess) cache synchronization completed size=0 recordCount=0 (ProcessSwapping) Removing process with pid 0 from the origin cache set WebPageProxy 7 activityStateDidChange - mayHaveChanged loading WebPageProxy 7 dispatchActivityStateChange - potentiallyChangedActivityStateFlags loading WebPageProxy 7 dispatchActivityStateChange: state changed from active window, focused, visible, visible or occluded, in-window to active window, focused, visible, visible or occluded, in-window, loading <unknown> - TestController::run - Failed to reset state to consistent values #PROCESS UNRESPONSIVE - WPEWebProcess ```
Lauro Moura
Comment 3 2020-04-09 21:33:08 PDT
Some tests in the wpe-debug-tests bot with different timeout values (startup timeout is 1/4 of the regular timeout) and 20 processes: 30s - Always timeouts on startup (That means the WebProcess is taking more than 7,5s to start and load the about:blank page. 45s - Timeouts most times 60s - Never timeouts Upping the number of processes to 25 made the 60s timeout to trigger the startup issues too. Some more pairs: 30s/10proc - timeout 30s/5proc - timeout sometimes 30s/1proc - works Given we are planning to move the bots to a flatpak-based setup, we could try increasing the timeout limit/reducing the number of parallel jobs to allow the bot to run while a proper fix is found.
Lauro Moura
Comment 4 2020-04-21 11:56:32 PDT
After reenabling the WPE debug bots with the reduced number of processes from 20 to 10 : 12 builds (#1700 to #1711) 5 ran to completion (Between 1h45m ~1h51m) 7 exited early. 8 runs had those timeouts in the beginning. (One of them managed to run to the end).
Lauro Moura
Comment 5 2020-08-20 14:44:31 PDT
This has not happened since the move to the Flatpak SDK. Closing for now.
Note You need to log in before you can comment on or make changes to this bug.