Bug 209846
Summary: | [WPE] Debug bot timeouts with many unresponsive webprocess errors | ||
---|---|---|---|
Product: | WebKit | Reporter: | Lauro Moura <lmoura> |
Component: | WPE WebKit | Assignee: | Nobody <webkit-unassigned> |
Status: | RESOLVED CONFIGURATION CHANGED | ||
Severity: | Normal | CC: | bugs-noreply |
Priority: | P2 | ||
Version: | Other | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
See Also: | https://bugs.webkit.org/show_bug.cgi?id=188048 |
Lauro Moura
Right after starting the tests, many of the workers fail to run the first tests, with messages like:
04:32:07.666 78279 WPEWebProcess is unresponsive, pid = None
04:32:07.668 78279 worker/8 crypto/workers/crypto-gc-worker.html output stderr lines:
04:32:07.668 78279 <unknown> - TestController::run - Failed to reset state to consistent values
04:32:07.668 78279 #PROCESS UNRESPONSIVE - WPEWebProcess
04:32:07.693 78279 "ruby --version" took 0.02s
04:32:07.838 78279 "ruby -I /home/buildbot/wpe/wpe-linux-64-debug-tests/build/Websites/bugs.webkit.org/PrettyPatch /home/buildbot/wpe/wpe-linux-64-debug-tests/build/Websites/bugs.webkit.org/PrettyPatch/prettify.rb /home/buildbot/wpe/wpe-linux-64-debug-tests/build/layout-test-results/crypto/workers/crypto-gc-worker-diff.txt" took 0.14s
It seems to have started around build #3470 (containing revisions r256477 through r256502, Feb 12th). Before it, this error happened rather sparsely (see list below). Since this build, it started occurring right from the start quite frequently.
Occurrences of this error in the builds before:
* 3457 - 0 - Earliest build checked
* ... No ocurrences between
* 3463 - 0
* 3464 - 1 - transitions/negative-delay.html, late in the test run
* 3465 - 1 - storage/indexeddb/pending-version-change-on-exit-private.html, late in the test run
* 3466 - 0
* 3467 - 0
* 3468 - 1 - storage/indexeddb/optional-arguments.html, late in the test run
* 3469 - 1 - imported/w3c/web-platform-tests/html/browsers/history/the-history-interface/history_go_minus.html
* 3470 - 14 - All 10 workers fail in the first test they run, with more failures later, triggering the circuit breaker.
Link to build: https://build.webkit.org/builders/WPE%20Linux%2064-bit%20Debug%20%28Tests%29/builds/3470
Still have not managed to reproduce it locally.
Attachments | ||
---|---|---|
Add attachment proposed patch, testcase, etc. |
Lauro Moura
Correction: Managed to reproduce the issue with jhbuild (which is still used in the bots).
Lauro Moura
Hard to reproduce consistently on my setup. Sometimes happening when a lot of parallel tests are run (like 25 instances in my 8 core laptop, but rarely).
Directly on the bot it is failing more consistently with the following command line:
$ python ./Tools/Scripts/run-webkit-tests --no-build --no-show-results --no-new-test-results --clobber-old-results --exit-after-n-crashes-or-timeouts 2 --exit-after-n-failures 5 --debug --wpe --results-directory layout-test-results --debug-rwt-logging --child-processes=10 --iterations=20 --fully-parallel --no-http fast/dom/image-object.html
With WEBKIT_DEBUG=all, the crashing test gives this output, which up to the error message seems to be similar to what a normal run outputs:
```
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/WebPreferences.cpp(201) : void WebKit::WebPreferences::platformInitializeStore()
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/WebPreferences.cpp(244) : bool WebKit::WebPreferences::platformGetBoolUserValueForKey(const WTF::String&, bool&)
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/WebPreferences.cpp(250) : bool WebKit::WebPreferences::platformGetUInt32UserValueForKey(const WTF::String&, uint32_t&)
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/WebPreferences.cpp(213) : void WebKit::WebPreferences::platformUpdateBoolValueForKey(const WTF::String&, bool)
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/WebPreferences.cpp(208) : void WebKit::WebPreferences::platformUpdateStringValueForKey(const WTF::String&, const WTF::String&)
(Back/Forward) Created WebBackForwardList 0x7fe9502ed3b8
UNIMPLEMENTED:
../../Source/WebKit/UIProcess/wpe/WebPageProxyWPE.cpp(44) : void WebKit::WebPageProxy::platformInitialize()
(NetworkProcess) synchronizing cache
(NetworkProcess) opened cache storage, success 1
(NetworkProcess) blob synchronization completed approximateSize=0
(NetworkProcess) cache synchronization completed size=0 recordCount=0
(NetworkProcess) synchronizing cache
(NetworkProcess) opened cache storage, success 1
(NetworkProcess) blob synchronization completed approximateSize=0
(NetworkProcess) cache synchronization completed size=0 recordCount=0
(ProcessSwapping) Removing process with pid 0 from the origin cache set
WebPageProxy 7 activityStateDidChange - mayHaveChanged loading
WebPageProxy 7 dispatchActivityStateChange - potentiallyChangedActivityStateFlags loading
WebPageProxy 7 dispatchActivityStateChange: state changed from active window, focused, visible, visible or occluded, in-window to active window, focused, visible, visible or occluded, in-window, loading
<unknown> - TestController::run - Failed to reset state to consistent values
#PROCESS UNRESPONSIVE - WPEWebProcess
```
Lauro Moura
Some tests in the wpe-debug-tests bot with different timeout values (startup timeout is 1/4 of the regular timeout) and 20 processes:
30s - Always timeouts on startup (That means the WebProcess is taking more than 7,5s to start and load the about:blank page.
45s - Timeouts most times
60s - Never timeouts
Upping the number of processes to 25 made the 60s timeout to trigger the startup issues too.
Some more pairs:
30s/10proc - timeout
30s/5proc - timeout sometimes
30s/1proc - works
Given we are planning to move the bots to a flatpak-based setup, we could try increasing the timeout limit/reducing the number of parallel jobs to allow the bot to run while a proper fix is found.
Lauro Moura
After reenabling the WPE debug bots with the reduced number of processes from 20 to 10 :
12 builds (#1700 to #1711)
5 ran to completion (Between 1h45m ~1h51m)
7 exited early.
8 runs had those timeouts in the beginning. (One of them managed to run to the end).
Lauro Moura
This has not happened since the move to the Flatpak SDK. Closing for now.