Bug 229758 - [GTK] The Xvfb display server may fail to start sometimes causing tests to randomly crash
Summary: [GTK] The Xvfb display server may fail to start sometimes causing tests to ra...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Carlos Alberto Lopez Perez
URL:
Keywords: InRadar
Depends on: 229990
Blocks:
  Show dependency treegraph
 
Reported: 2021-09-01 04:33 PDT by Carlos Alberto Lopez Perez
Modified: 2021-09-22 04:16 PDT (History)
16 users (show)

See Also:


Attachments
Patch (21.56 KB, patch)
2021-09-01 10:47 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff
Patch (21.56 KB, patch)
2021-09-01 12:16 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff
Patch (22.29 KB, patch)
2021-09-06 06:32 PDT, Carlos Alberto Lopez Perez
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Alberto Lopez Perez 2021-09-01 04:33:53 PDT
I have observed this here: https://build.webkit.org/#/builders/199/builds/1062
 - If you take a look at that build there are 11 unexpected crashes which only happened on that build (didn't happened not on the build before or after)
 - There are no crash log for any of the crashes, this is weird.. it is an evidence it was not an actual crash but a failure to start
 - And if you take a closer look to the stdio log you see the crashes happened on a few workers. As soon as this workers were started all the tests on that workers crashed with a common stderr message: "Gtk-WARNING **: 15:52:29.549: cannot open display: :$X". And as soon the driver was restarted the tests started to work back again.

So, I think that what happened here was that the Xvfb driver failed to start on this workers, leading to crashes on all tests until it was restarted.

We need to make the Xvfb driver more robust and deal better with an eventual failure to start Xvfb.
Comment 1 Carlos Alberto Lopez Perez 2021-09-01 09:24:30 PDT
Looks like this was more common than I expected.. I have been able to reproduce the issue of Xvfb not starting properly locally when testing this. Patch incoming
Comment 2 Carlos Alberto Lopez Perez 2021-09-01 10:47:18 PDT
Created attachment 437043 [details]
Patch
Comment 3 Philippe Normand 2021-09-01 12:05:23 PDT
EWS is red, python2 test failure: webkitpy.port.xvfbdriver_unittest.XvfbDriverTest.test_xvfb_not_replying

I'm all-in-favor for fixing this, but it begs the question, should we soon migrate the bots to wayland and consider X as a legacy thing?
Comment 4 Carlos Alberto Lopez Perez 2021-09-01 12:16:07 PDT
Created attachment 437051 [details]
Patch
Comment 5 Carlos Alberto Lopez Perez 2021-09-01 12:24:33 PDT
(In reply to Philippe Normand from comment #3)
> EWS is red, python2 test failure:
> webkitpy.port.xvfbdriver_unittest.XvfbDriverTest.test_xvfb_not_replying
> 
> I'm all-in-favor for fixing this, but it begs the question, should we soon
> migrate the bots to wayland and consider X as a legacy thing?

Running on a real wayland display is a major issue because it complicates a lot the hardware maintenance and deployment (makes difficult to use containers, also depending on a specific graphics hardware drivers makes failures less reproducible, etc).

I would be all in favour of migrating to wayland if there is a way of having the tests to run in a virtualized wayland environment. Fortunately we have that, currently we can pass --display-server=weston to run-webkit-tests which runs the tests inside weston inside Xvfb. Is a bit hackish but it works.

This patch also fixes the startup of Xvfb for the --display-server=weston option (weston in Xvfb--- fixes the Xvfb part). I plan to also fix the weston startup part (likely there are race conditions there as well) on a follow-up patch soon.

Then we would need to check how different are the results from the Weston driver compared to the Xvfb one.

Also I'm not really found of this solution of Weston inside Xvfb.. if there is a better way of achieving a virtualized display for wayland please let me know that.
Comment 6 Philippe Normand 2021-09-01 12:47:39 PDT
> if there is a better way of achieving a virtualized display for wayland please let me know that.

Maybe there's a way to run weston in swrast mode? I don't know, I haven't done research in that field :) But yeah it's a bit sad to run weston within xvfb :(
Comment 7 Michael Catanzaro 2021-09-01 12:51:17 PDT
(In reply to Carlos Alberto Lopez Perez from comment #5)
> Also I'm not really found of this solution of Weston inside Xvfb.. if there
> is a better way of achieving a virtualized display for wayland please let me
> know that.

I think we can do:

$ weston --backend=headless-backend.so

If that doesn't work, mutter has a headless mode added in https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1698. You could try:

$ mutter --wayland --headless --virtual-monitor 1280x720

I know wlroots also has a headless mode.
Comment 8 Carlos Alberto Lopez Perez 2021-09-01 13:20:55 PDT
Comment on attachment 437051 [details]
Patch

Clearing flags on attachment: 437051

Committed r281870 (241200@main): <https://commits.webkit.org/241200@main>
Comment 9 Carlos Alberto Lopez Perez 2021-09-01 13:21:01 PDT
All reviewed patches have been landed.  Closing bug.
Comment 10 Radar WebKit Bug Importer 2021-09-01 13:21:27 PDT
<rdar://problem/82643101>
Comment 11 Carlos Alberto Lopez Perez 2021-09-01 13:52:57 PDT
This broke GTK API tests :\ ... it seems the GTK API test runner script runs with python2 and the code I added is not working properly there .. I will revert this.
Comment 12 Carlos Alberto Lopez Perez 2021-09-01 13:56:24 PDT
Reverted r281870 for reason:

It

Committed r281874 (241204@main): <https://commits.webkit.org/241204@main>
Comment 13 Jonathan Bedard 2021-09-01 14:06:18 PDT
(In reply to Carlos Alberto Lopez Perez from comment #11)
> This broke GTK API tests :\ ... it seems the GTK API test runner script runs
> with python2 and the code I added is not working properly there .. I will
> revert this.

On this note, we've started changing invocations of all Apple's Python scripts to Python3. Other ports will get some of this work for free (already happened with run-webkit-tests, for example) but Apple folks won't make that change for scripts we don't own or partially own
Comment 14 Carlos Alberto Lopez Perez 2021-09-01 17:28:38 PDT
(In reply to Jonathan Bedard from comment #13)
> (In reply to Carlos Alberto Lopez Perez from comment #11)
> > This broke GTK API tests :\ ... it seems the GTK API test runner script runs
> > with python2 and the code I added is not working properly there .. I will
> > revert this.
> 
> On this note, we've started changing invocations of all Apple's Python
> scripts to Python3. Other ports will get some of this work for free (already
> happened with run-webkit-tests, for example) but Apple folks won't make that
> change for scripts we don't own or partially own

That's for the note. I think that is fair and makes sense. Thanks for working on this, btw.

I opened bug 229782 for porting the GTK/WPE API test runner to python3
Also opened bug 229783 for run-perf-tests
Comment 15 Carlos Alberto Lopez Perez 2021-09-06 06:21:40 PDT
I converted run-gtk-tests, run-wpe-tests and run-perf-tests to use python3. But I realized that there are still more scripts running with python2 like run-webdriver-tests and maybe even others which I still didn't realized about.

So I will upload now a new version of this patch that is compatible with python2, to not block here on migrating every consumer of it to python3.
Comment 16 Carlos Alberto Lopez Perez 2021-09-06 06:32:18 PDT
Created attachment 437409 [details]
Patch

Make code compatible with python2. This is the diff against the previous patch: http://sprunge.us/BfNZVc?diff
Comment 17 Carlos Alberto Lopez Perez 2021-09-06 12:48:39 PDT
Comment on attachment 437409 [details]
Patch

Clearing flags on attachment: 437409

Committed r282064 (241364@main): <https://commits.webkit.org/241364@main>
Comment 18 Carlos Alberto Lopez Perez 2021-09-06 12:48:45 PDT
All reviewed patches have been landed.  Closing bug.
Comment 19 WebKit Commit Bot 2021-09-07 03:28:32 PDT
Re-opened since this is blocked by bug 229990
Comment 20 Carlos Alberto Lopez Perez 2021-09-07 07:02:23 PDT
(In reply to WebKit Commit Bot from comment #19)
> Re-opened since this is blocked by bug 229990

An update of what happened here.. On r282064 I messed things up and by mistake I redirected stderr to stdout on the subprocess call to check the screen size of the Xvfb display, so when for any reason the program that checks the size prints anything on stderr (like a gtk/glib warning), then it would mix stderr with stdout and it will make the check fail (since the screen dimensions of the display it reports on stdout are compared with the expected ones)

In the reported case this was happening:

04:13:30.337 2 worker/5 Waiting for Xvfb display server to be ready.
04:13:30.338 2 worker/1 The queried Xvfb screen size "(print-screen-size:258): dbind-WARNING **: 04:13:30.304: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-WDVOI6JrL3: No such file or directory
1024x768" does not match the expectation "1024x768".

The string """(print-screen-size:258): dbind-WARNING **: 04:13:30.304: Couldn't connect to accessibility bus: Failed to connect to socket /tmp/dbus-WDVOI6JrL3: No such file or directory""" should belong only to stderr.

This patch on top fixes the issue: http://sprunge.us/r865Ui

I will re-land it with the above fix
Comment 21 Carlos Alberto Lopez Perez 2021-09-07 07:14:01 PDT
Committed r282082 (241382@main): <https://commits.webkit.org/241382@main>