Bug 227715 - [webkitcorepy] run-webkit-tests may hang with python2 after r271683
Summary: [webkitcorepy] run-webkit-tests may hang with python2 after r271683
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Carlos Alberto Lopez Perez
URL:
Keywords: InRadar
Depends on:
Blocks:
 
Reported: 2021-07-06 11:24 PDT by Carlos Alberto Lopez Perez
Modified: 2021-07-09 01:11 PDT (History)
8 users (show)

See Also:


Attachments
Patch (1.86 KB, patch)
2021-07-06 11:38 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff
Patch (1.83 KB, patch)
2021-07-08 17:45 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Alberto Lopez Perez 2021-07-06 11:24:26 PDT
Checking the testing EWS for GTK WK2 layout test I noticed most of them were failing to complete the layout-test step and they were hanging at the end.

Example: https://ews-build.webkit-uat.org/#/builders/34/builds/34722

09:54:19.733 3 worker/5 imported/w3c/web-platform-tests/service-workers/service-worker/fetch-cors-xhr.https.html passed
09:54:19.748 3 worker/15 imported/w3c/web-platform-tests/streams/piping/flow-control.any.worker.html passed
09:54:19.778 3 worker/6 imported/w3c/web-platform-tests/streams/transform-streams/lipfuzz.any.worker.html passed
09:54:19.780 3 worker/2 "ruby -I /app/webkit/Websites/bugs.webkit.org/PrettyPatch /app/webkit/Websites/bugs.webkit.org/PrettyPatch/prettify.rb /home/ews/worker/GTK-WK2-Tests-EWS/build/layout-test-results/imported/w3c/web-platform-tests/streams/readable-streams/general.any-diff.txt" took 0.18s
09:54:19.780 3 [42759/54367] imported/w3c/web-platform-tests/streams/readable-streams/general.any.html failed unexpectedly (text diff)
09:54:20.079 3 worker/2 imported/w3c/web-platform-tests/streams/readable-streams/general.any.html failed:
09:54:20.079 3 worker/2 text diff
09:54:20.128 3 Some workers failed to gracefully shut down, but in-flight exception taking precedence
command timed out: 1200 seconds without output running ['python3', 'Tools/Scripts/run-webkit-tests', '--no-build', '--no-show-results', '--no-new-test-results', '--clobber-old-results', '--release', '--gtk', '--results-directory', 'layout-test-results', '--debug-rwt-logging', '--exit-after-n-failures', '30', '--skip-failing-tests'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1895.925172



The issue happens because a combination of two things:

 - The GTK workers are still using python2 for running the layout tests. This is because the shebang on the scripts points to "python" that defaults to python2. And even if you run the script with "python3" like in "python3 Tools/Scripts/run-webkit-tests" you actually end running it with python2 because of the flatpak-sdk and how the process are re-executed inside the container. We need to change the shebangs here.
 - A race condition happens when the layout tests are run with "--exit-after-n-failures N" and more than N unexpected failures happen when RWT is executed with python2. When "--exit-after-n-failures" triggers because more unexpected failures than N happen, then a TestRunInterruptedException() exception is raised from the worker and then on the exit handler for the task-queue the workers are terminated via os.kill(). Then the queue of workers is closed, but that causes a hang with python2 when waiting for the threads to be joined. This issue is not reproducible with python3.


A way to reproduce this that worked for me reliable is:

# Invalidate all the expected results for the css3 tests (truncate them to 10 chars)
$ find LayoutTests/css3 -name \*expected.txt -exec truncate -s10 '{}' \;
# Run the tests with python2
$ python2 Tools/Scripts/run-webkit-tests --no-build --release --gtk  --debug-rwt-logging --exit-after-n-failures 100 fast css3
Comment 1 Carlos Alberto Lopez Perez 2021-07-06 11:38:04 PDT
Created attachment 432954 [details]
Patch
Comment 2 Carlos Alberto Lopez Perez 2021-07-06 11:47:18 PDT
(In reply to Carlos Alberto Lopez Perez from comment #0)
> 
> The issue happens because a combination of two things:
> 
>  - The GTK workers are still using python2 for running the layout tests.
> This is because the shebang on the scripts points to "python" that defaults
> to python2. And even if you run the script with "python3" like in "python3
> Tools/Scripts/run-webkit-tests" you actually end running it with python2
> because of the flatpak-sdk and how the process are re-executed inside the
> container. We need to change the shebangs here.

Seems running with python2 is still the default everywhere but in the UAT?

I got confused because layout tests are running with a python3 prefix on  https://ews-build.webkit-uat.org/ but not on https://ews-build.webkit.org/
Comment 3 Aakash Jain 2021-07-06 12:12:16 PDT
(In reply to Carlos Alberto Lopez Perez from comment #2)
> Seems running with python2 is still the default everywhere but in the UAT?
That's right. Jonathan was using UAT for testing Python 3 change. It has not yet landed. See https://bugs.webkit.org/show_bug.cgi?id=226658
Comment 4 Carlos Alberto Lopez Perez 2021-07-06 16:43:30 PDT
(In reply to Aakash Jain from comment #3)
> (In reply to Carlos Alberto Lopez Perez from comment #2)
> > Seems running with python2 is still the default everywhere but in the UAT?
> That's right. Jonathan was using UAT for testing Python 3 change. It has not
> yet landed. See https://bugs.webkit.org/show_bug.cgi?id=226658

I see.

Then this issue potentially affects all EWS runners because they run with '--exit-after-n-failures 30'. In the event that more than 30 unexpected failures happen then run-webkit-tests will hang (at least on Linux)
Comment 5 Michael Catanzaro 2021-07-08 04:24:14 PDT
(In reply to Carlos Alberto Lopez Perez from comment #0)
> And even if you run the script with "python3" like in "python3
> Tools/Scripts/run-webkit-tests" you actually end running it with python2
> because of the flatpak-sdk and how the process are re-executed inside the
> container. We need to change the shebangs here.

Note: in freedesktop-sdk, /usr/bin/python is python3. It's only python2 in our runtime because we build our own python2. I believe that is done specifically for it to be used by run-webkit-tests, but maybe we have other developer scripts that still require python2? Phil would probably know for sure.
Comment 6 Philippe Normand 2021-07-08 04:50:17 PDT
https://bugs.webkit.org/show_bug.cgi?id=221043#c7
Comment 7 Carlos Alberto Lopez Perez 2021-07-08 17:22:20 PDT
Can someone please review this?

Now this is causing issues on our Debug bot:

See https://build.webkit.org/#/builders/63/builds/1638

15:27:52.105 3 [199/58328] accessibility/adjacent-continuations-cause-assertion-failure.html failed unexpectedly (WebKitWebProcess crashed [pid=11420])
15:27:53.613 3 worker/0 killing driver
15:27:53.613 3 worker/0 accessibility/adjacent-continuations-cause-assertion-failure.html failed:
15:27:53.613 3 worker/0 WebKitWebProcess crashed [pid=11420]
15:27:53.664 3 Some workers failed to gracefully shut down, but in-flight exception taking precedence
command timed out: 1200 seconds without output running ['python', './Tools/Scripts/run-webkit-tests', '--no-build', '--no-show-results', '--no-new-test-results', '--clobber-old-results', '--builder-name', 'GTK-Linux-64-bit-Debug-Tests', '--build-number', '1638', '--buildbot-worker', 'gtk-linux-bot-7', '--buildbot-master', 'build.webkit.org', '--report', 'https://results.webkit.org', '--exit-after-n-crashes-or-timeouts', '50', '--exit-after-n-failures', '500', '--debug', '--gtk', '--results-directory', 'layout-test-results', '--debug-rwt-logging'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1635.329821

It seems some commit caused more than 500 failures and the bot should be exiting early, but instead of exiting early is hanging due to this bug.
Comment 8 Carlos Alberto Lopez Perez 2021-07-08 17:41:55 PDT
(In reply to Carlos Alberto Lopez Perez from comment #7)
> It seems some commit caused more than 500 failures and the bot should be
> exiting early, but instead of exiting early is hanging due to this bug.

Actually is crashing and exiting due to more than 50 crashes :\
Comment 9 Carlos Alberto Lopez Perez 2021-07-08 17:45:27 PDT
Created attachment 433190 [details]
Patch

Just an edit on the comment
Comment 10 EWS 2021-07-09 01:10:26 PDT
Committed r279775 (239544@main): <https://commits.webkit.org/239544@main>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 433190 [details].
Comment 11 Radar WebKit Bug Importer 2021-07-09 01:11:17 PDT
<rdar://problem/80364971>