Bug 153217 - Layout tests get waiting forever when the abort early count triggers.
Summary: Layout tests get waiting forever when the abort early count triggers.
Status: NEW
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: Other
Hardware: Unspecified Unspecified
: P1 Major
Assignee: Carlos Alberto Lopez Perez
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-18 18:24 PST by Carlos Alberto Lopez Perez
Modified: 2016-02-01 16:41 PST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Alberto Lopez Perez 2016-01-18 18:24:15 PST
When the layout tests abort early because any of:

--exit-after-n-failures 500 
--exit-after-n-crashes-or-timeouts 50

The run-webkit-tests don't exits, but it gets stalled waiting forever.

This can be seen for example on this log: https://build.webkit.org/builders/GTK%20Linux%2064-bit%20Debug%20%28Tests%29/builds/6641/steps/layout-test/logs/stdio

[...]
16:38:47.334 13734 worker/2 cleaning up
16:38:47.334 13734 worker/2 cleaning up
16:38:47.339 13734 worker/3 cleaning up
16:38:47.339 13734 worker/3 cleaning up
16:38:47.345 13734 worker/4 cleaning up
16:38:47.345 13734 worker/4 cleaning up
16:38:47.350 13734 worker/5 cleaning up
16:38:47.350 13734 worker/5 cleaning up
16:38:47.363 13734 Exiting early after 20 crashes and 30 timeouts. 31033 tests run.   <--- Here triggers the --exit-after-n-crashes-or-timeouts 50
16:38:47.363 13734 Stopping HTTP server ...
16:38:47.364 13734 Attempting to shut down httpd server at pid 13767
16:38:47.439 13734 Waiting for action: <function <lambda> at 0x7ffdc8347050>
16:38:48.440 13734 httpd server at pid 13767 stopped
16:38:48.440 13734 Stopping WebSocket server ...
16:38:48.440 13734 Attempting to shut down pywebsocket server at pid 13773
16:38:48.440 13734 Waiting for action: <bound method PyWebSocket._check_and_kill of <webkitpy.layout_tests.servers.websocket_server.PyWebSocket object at 0x7ffdc83284d0>>
16:38:49.440 13734 Waiting for action: <bound method PyWebSocket._check_and_kill of <webkitpy.layout_tests.servers.websocket_server.PyWebSocket object at 0x7ffdc83284d0>>
16:38:50.441 13734 pywebsocket server at pid 13773 stopped
16:38:50.441 13734 Stopping Web Platform Test server ...
16:38:50.441 13734 Attempting to shut down wptwk server at pid 13775
16:38:50.441 13734 Stopping wptwk server
16:38:50.441 13734 Cleaning WPT resources files
16:38:50.442 13734 Cleaning WPT web platform server config.json
16:38:50.456 13734 wptwk server at pid 13775 stopped
16:38:50.469 13734 Flushing stdout
16:38:50.469 13734 Flushing stderr
16:38:50.469 13734 Stopping helper
16:38:50.469 13734 Cleaning up port
16:38:50.486 13734 Restoring module-stream-restore failed
[...]


------------------------------------------------------------------------------

command timed out: 1200 seconds without output, killing pid 13733                   <--- it gets killed after the 1200 max timeout for any step on the bots is reached
process killed by signal 9
program finished with exit code -1
elapsedTime=3368.105984


This can be reproduced be easily reproduced by running the tests with any of

--exit-after-n-failures 2 
--exit-after-n-crashes-or-timeouts 2

It happens both with release and debug builds.

It affects the GTK port, and probably all the other ports (didn't tested).

This is causing unusual long test runs on the GTK debug test bot.
Comment 1 Carlos Alberto Lopez Perez 2016-01-19 04:27:09 PST
I will be looking into this this week.
Comment 2 Carlos Alberto Lopez Perez 2016-01-22 08:25:47 PST
An important observation: This is only reproducible when running the full layout tests.  That is, when you don't pass any test name or directory in the arguments.
Comment 3 Carlos Alberto Lopez Perez 2016-01-26 04:49:08 PST
I'm not sure if this is a bug on webkitpy or on python-2.7 itself.

The issue is that it is deadlocking on the atexit handlers when sys.exit() is invoked.

[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
[SUBDEBUG/MainProcess] calling <Finalize object, callback=_finalize_join, args=[<weakref at 0x7f48dacbdc00; to 'Thread' at 0x7f48dacce5d0>], exitprority=-5>
[SUBDEBUG/MainProcess] finalizer calling <function _finalize_join at 0x7f48dacc7320> with args [<weakref at 0x7f48dacbdc00; to 'Thread' at 0x7f48dacce5d0>] and kwargs {}
[DEBUG/MainProcess] joining queue thread


I have identified three ways of making this issue go away:

 * Reverting this specific changeset on the python 2.7 branch: https://hg.python.org/cpython/rev/d316315a8781 
 * Using os._exit() instead of sys.exit() at run_webkit_tests.py
 * Setting self._messages_to_worker.cancel_join_thread() on message_pool.py. 


As I'm not sure if any of the suggested changes on webkitpy would be right, the first option looks the more sane for the moment because that changeset was only applied to the 2.7 branch of python and may be wrong.

So I have locally reverted that changeset on the python-2.7 of the GTK+ Debug test bot and this should make the issue go away, that was the main priority.

In the mid term I would like to either come with a patch for webkitpy or come with a simple test case so I can report the issue to python upstream.
Comment 4 Alexey Proskuryakov 2016-01-26 09:24:56 PST
Longer term, webkitpy should move away from the multiprocessing module, which is intrinsically buggy.

The way it works is by forking processes without exec, which is incompatible with any system library code that needs per-process initialization.