Checking the testing EWS for GTK WK2 layout test I noticed most of them were failing to complete the layout-test step and they were hanging at the end. Example: https://ews-build.webkit-uat.org/#/builders/34/builds/34722 09:54:19.733 3 worker/5 imported/w3c/web-platform-tests/service-workers/service-worker/fetch-cors-xhr.https.html passed 09:54:19.748 3 worker/15 imported/w3c/web-platform-tests/streams/piping/flow-control.any.worker.html passed 09:54:19.778 3 worker/6 imported/w3c/web-platform-tests/streams/transform-streams/lipfuzz.any.worker.html passed 09:54:19.780 3 worker/2 "ruby -I /app/webkit/Websites/bugs.webkit.org/PrettyPatch /app/webkit/Websites/bugs.webkit.org/PrettyPatch/prettify.rb /home/ews/worker/GTK-WK2-Tests-EWS/build/layout-test-results/imported/w3c/web-platform-tests/streams/readable-streams/general.any-diff.txt" took 0.18s 09:54:19.780 3 [42759/54367] imported/w3c/web-platform-tests/streams/readable-streams/general.any.html failed unexpectedly (text diff) 09:54:20.079 3 worker/2 imported/w3c/web-platform-tests/streams/readable-streams/general.any.html failed: 09:54:20.079 3 worker/2 text diff 09:54:20.128 3 Some workers failed to gracefully shut down, but in-flight exception taking precedence command timed out: 1200 seconds without output running ['python3', 'Tools/Scripts/run-webkit-tests', '--no-build', '--no-show-results', '--no-new-test-results', '--clobber-old-results', '--release', '--gtk', '--results-directory', 'layout-test-results', '--debug-rwt-logging', '--exit-after-n-failures', '30', '--skip-failing-tests'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=1895.925172 The issue happens because a combination of two things: - The GTK workers are still using python2 for running the layout tests. This is because the shebang on the scripts points to "python" that defaults to python2. And even if you run the script with "python3" like in "python3 Tools/Scripts/run-webkit-tests" you actually end running it with python2 because of the flatpak-sdk and how the process are re-executed inside the container. We need to change the shebangs here. - A race condition happens when the layout tests are run with "--exit-after-n-failures N" and more than N unexpected failures happen when RWT is executed with python2. When "--exit-after-n-failures" triggers because more unexpected failures than N happen, then a TestRunInterruptedException() exception is raised from the worker and then on the exit handler for the task-queue the workers are terminated via os.kill(). Then the queue of workers is closed, but that causes a hang with python2 when waiting for the threads to be joined. This issue is not reproducible with python3. A way to reproduce this that worked for me reliable is: # Invalidate all the expected results for the css3 tests (truncate them to 10 chars) $ find LayoutTests/css3 -name \*expected.txt -exec truncate -s10 '{}' \; # Run the tests with python2 $ python2 Tools/Scripts/run-webkit-tests --no-build --release --gtk --debug-rwt-logging --exit-after-n-failures 100 fast css3
Created attachment 432954 [details] Patch
(In reply to Carlos Alberto Lopez Perez from comment #0) > > The issue happens because a combination of two things: > > - The GTK workers are still using python2 for running the layout tests. > This is because the shebang on the scripts points to "python" that defaults > to python2. And even if you run the script with "python3" like in "python3 > Tools/Scripts/run-webkit-tests" you actually end running it with python2 > because of the flatpak-sdk and how the process are re-executed inside the > container. We need to change the shebangs here. Seems running with python2 is still the default everywhere but in the UAT? I got confused because layout tests are running with a python3 prefix on https://ews-build.webkit-uat.org/ but not on https://ews-build.webkit.org/
(In reply to Carlos Alberto Lopez Perez from comment #2) > Seems running with python2 is still the default everywhere but in the UAT? That's right. Jonathan was using UAT for testing Python 3 change. It has not yet landed. See https://bugs.webkit.org/show_bug.cgi?id=226658
(In reply to Aakash Jain from comment #3) > (In reply to Carlos Alberto Lopez Perez from comment #2) > > Seems running with python2 is still the default everywhere but in the UAT? > That's right. Jonathan was using UAT for testing Python 3 change. It has not > yet landed. See https://bugs.webkit.org/show_bug.cgi?id=226658 I see. Then this issue potentially affects all EWS runners because they run with '--exit-after-n-failures 30'. In the event that more than 30 unexpected failures happen then run-webkit-tests will hang (at least on Linux)
(In reply to Carlos Alberto Lopez Perez from comment #0) > And even if you run the script with "python3" like in "python3 > Tools/Scripts/run-webkit-tests" you actually end running it with python2 > because of the flatpak-sdk and how the process are re-executed inside the > container. We need to change the shebangs here. Note: in freedesktop-sdk, /usr/bin/python is python3. It's only python2 in our runtime because we build our own python2. I believe that is done specifically for it to be used by run-webkit-tests, but maybe we have other developer scripts that still require python2? Phil would probably know for sure.
https://bugs.webkit.org/show_bug.cgi?id=221043#c7
Can someone please review this? Now this is causing issues on our Debug bot: See https://build.webkit.org/#/builders/63/builds/1638 15:27:52.105 3 [199/58328] accessibility/adjacent-continuations-cause-assertion-failure.html failed unexpectedly (WebKitWebProcess crashed [pid=11420]) 15:27:53.613 3 worker/0 killing driver 15:27:53.613 3 worker/0 accessibility/adjacent-continuations-cause-assertion-failure.html failed: 15:27:53.613 3 worker/0 WebKitWebProcess crashed [pid=11420] 15:27:53.664 3 Some workers failed to gracefully shut down, but in-flight exception taking precedence command timed out: 1200 seconds without output running ['python', './Tools/Scripts/run-webkit-tests', '--no-build', '--no-show-results', '--no-new-test-results', '--clobber-old-results', '--builder-name', 'GTK-Linux-64-bit-Debug-Tests', '--build-number', '1638', '--buildbot-worker', 'gtk-linux-bot-7', '--buildbot-master', 'build.webkit.org', '--report', 'https://results.webkit.org', '--exit-after-n-crashes-or-timeouts', '50', '--exit-after-n-failures', '500', '--debug', '--gtk', '--results-directory', 'layout-test-results', '--debug-rwt-logging'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=1635.329821 It seems some commit caused more than 500 failures and the bot should be exiting early, but instead of exiting early is hanging due to this bug.
(In reply to Carlos Alberto Lopez Perez from comment #7) > It seems some commit caused more than 500 failures and the bot should be > exiting early, but instead of exiting early is hanging due to this bug. Actually is crashing and exiting due to more than 50 crashes :\
Created attachment 433190 [details] Patch Just an edit on the comment
Committed r279775 (239544@main): <https://commits.webkit.org/239544@main> All reviewed patches have been landed. Closing bug and clearing flags on attachment 433190 [details].
<rdar://problem/80364971>