Bug 37987

Summary:

[chromium] new-run-webkit-tests hangs on Chromium Bots (OS X and Linux)

Product:

WebKit

Reporter:

Yuzo Fujishima <yuzo>

Component:

Tools / Tests

Assignee:

Nobody <webkit-unassigned>

Status:

RESOLVED FIXED

Severity:

Normal

CC:

cjerdonek, commit-queue, dpranke, eric, jamesr, ojan, tkent, tony

Priority:

Version:

528+ (Nightly build)

Hardware:

Mac (Intel)

OS:

OS X 10.5

Bug Depends on:

38011

Bug Blocks:

38300

Attachments:

Description	Flags
stack dump	none
Patch	none
Patch	none
Add mising import	none

Yuzo Fujishima

Reported 2010-04-22 05:22:44 PDT

Revision: 58075 (and later) When invoked from <CHROMIUM>/src/webkit/tools/layout_tests/run_webkit_tests.py, WebKitTools/Scripts/new-run-webkit-tests hangs until it times out after 600 seconds. I observed this both in my local (Chromium) environment: ./webkit/tools/layout_tests/run_webkit_tests.py --debug fast/[a-b]* --verbose and in Chromium's canary bots: http://build.chromium.org/buildbot/waterfall.fyi/waterfall?builder=Webkit+Mac+(webkit.org) (Sometimes the tests go through, sometimes hang.) As far as I tested, the hang is not caused by a specific test(s) -- the test being run by the hanging thread differs every time I tried (a few dozen times). -- Local Environment -- Traceback (collected by signaling): File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/new-run-webkit-tests", line 55, in <module> sys.exit(run_webkit_tests.main(options, args)) File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1628, in main num_unexpected_results = test_runner.run(result_summary, print_results) File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 678, in run self._run_tests(self._test_files_list, result_summary)) File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 622, in _run_tests thread.join(0.1) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 569, in join self.__block.wait(delay) File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 233, in wait _sleep(delay) Processes: 7163 ttys000 0:00.03 /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python ./webkit/tools/layout_tests/run_webkit_tests.py --debug fast/[a-b]* --verbose 7165 ttys000 0:07.77 /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python /Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/new-run-webkit-tests --chromium --debug fast/[a-b]* --verbose 7170 ttys000 0:00.02 /Users/yuzo/src/chromium/src/xcodebuild/Debug/layout_test_helper 7183 ttys000 0:01.58 /Users/yuzo/src/chromium/src/xcodebuild/Debug/TestShell.app/Contents/MacOS/TestShell --layout-tests --pixel-tests=/Users/yuzo/src/chromium/src/webkit/Debug/layout-test-results/png_result11.png -- Chromium Canary Bot -- http://build.chromium.org/buildbot/waterfall.fyi/builders/Webkit%20Mac%20(webkit.org)/builds/18643/steps/webkit_tests/logs/stdio ... 100422 03:48:12 dump_render_tree_thread.py:348 DEBUG Thread-1 fast/events/onunload.html passed 100422 03:48:12 dump_render_tree_thread.py:348 DEBUG Thread-2 editing/selection/thai-word-at-document-end.html passed 100422 03:48:12 dump_render_tree_thread.py:348 DEBUG Thread-2 editing/selection/toString-1.html passed 100422 03:48:12 dump_render_tree_thread.py:348 DEBUG Thread-2 editing/selection/toString.html passed 100422 03:48:12 dump_render_tree_thread.py:348 DEBUG Thread-1 fast/events/open-window-from-another-frame.html passed command timed out: 600 seconds without output, killing pid 20832 process killed by signal 9 program finished with exit code -1 elapsedTime=840.133609

Attachments
stack dump (19.56 KB, text/plain) 2010-04-22 22:59 PDT, Kent Tamura	no flags	Details
Patch (34.21 KB, patch) 2010-04-25 18:29 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (25.67 KB, patch) 2010-04-27 01:26 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Add mising import (25.89 KB, patch) 2010-04-27 01:28 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.

Eric Seidel (no email)

Comment 1 2010-04-22 11:58:21 PDT

We need a stack trace from the hung threads to diagnose. I'm attempting to build chromium now so that I can reproduce.

Eric Seidel (no email)

Comment 2 2010-04-22 12:54:20 PDT

OK. I'm not able to reproduce this (yet). Someone who is, please add the following code to your copy (any file that gets run is fine): import signal import threading import traceback def dumpstacks(signal, frame): code = [] for threadId, stack in sys._current_frames().items(): code.append("\n# Thread: %d" % threadId) for filename, lineno, name, line in traceback.extract_stack(stack): code.append('File: "%s", line %d, in %s' % (filename, lineno, name)) if line: code.append(" %s" % (line.strip())) print "\n".join(code) signal.signal(signal.SIGQUIT, dumpstacks) and then use: kill -QUIT PROCESS_ID And attach the dump of the stacks to this bug. Once we have that information, we can solve this.

Eric Seidel (no email)

Comment 3 2010-04-22 12:55:24 PDT

It would be possible to add that sort of thread dumping code to run-webkit-tests for normal --verbose mode operation and have it run any time it gets a SIGTERM (what it should be getting from the bots).

Eric Seidel (no email)

Comment 4 2010-04-22 14:07:33 PDT

So far James and I have not been able to reproduce this failure locally. Yuzo said he could last night. Unfortunately, we can't easily debug on the bots, because BuildBot sends SIGKILL to kill the process instead of SIGTERM so there is no debug logging we can do w/o writing additional watch timers or wrapper scripts. http://buildbot.net/trac/ticket/751

Eric Seidel (no email)

Comment 5 2010-04-22 14:12:43 PDT

I'm not sure how: ./webkit/tools/layout_tests/run_webkit_tests.py --debug fast/[a-b]* --verbose works. my shell would try to eat the "*", and even passing just "fast" doesn't find any tests.

Eric Seidel (no email)

Comment 6 2010-04-22 16:08:33 PDT

Adding some debugging code to help with this in bug 38011.

Kent Tamura

Comment 7 2010-04-22 22:59:24 PDT

Created attachment 54132 [details] stack dump I have a similar problem. If I run "new-run-webkit-tests --chromium --debug --use-drt --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and 14 DumpRenderTree process freeze. I had no problem with the source as of two days ago, and have the problem with the today's source. I attach an output for Comment #2.

Eric Seidel (no email)

Comment 8 2010-04-23 00:20:12 PDT

We have testing code running on the canary bots. We just need one of the stacktrace sets from that testing code as soon as someone notices a bot hang.

Kent Tamura

Comment 9 2010-04-23 01:01:40 PDT

(In reply to comment #7) > Created an attachment (id=54132) [details] > stack dump > > I have a similar problem. > If I run "new-run-webkit-tests --chromium --debug --use-drt > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > 14 DumpRenderTree process freeze. I had no problem with the source as of two > days ago, and have the problem with the today's source. > > I attach an output for Comment #2. For my case, r58044 "Unreviewed, just adding missing ":" in python file." was ok, and r58050 "Unreviewed. Attempt one more time to fix NRWT for chromium." made this problem.

Eric Seidel (no email)

Comment 10 2010-04-23 12:54:24 PDT

I theorized with tony that we could be passing non-utf8 strings to the test_shell process and getting it to hang. http://trac.webkit.org/changeset/58050 is not related to test_shell communication at all, so I don't see why it would cause hangs in reading from test_shell.

Eric Seidel (no email)

Comment 11 2010-04-23 13:05:56 PDT

(In reply to comment #7) > Created an attachment (id=54132) [details] > stack dump > > I have a similar problem. > If I run "new-run-webkit-tests --chromium --debug --use-drt > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > 14 DumpRenderTree process freeze. I had no problem with the source as of two > days ago, and have the problem with the today's source. > > I attach an output for Comment #2. How do I build an test this configuration? I would like to have a local reproduction of this bug.

Eric Seidel (no email)

Comment 12 2010-04-23 13:08:44 PDT

(In reply to comment #11) > (In reply to comment #7) > > Created an attachment (id=54132) [details] [details] > > stack dump > > > > I have a similar problem. > > If I run "new-run-webkit-tests --chromium --debug --use-drt > > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > > 14 DumpRenderTree process freeze. I had no problem with the source as of two > > days ago, and have the problem with the today's source. > > > > I attach an output for Comment #2. > > How do I build an test this configuration? I would like to have a local > reproduction of this bug. Also, DumpRenderTree uses a totally different Driver than test_shell, meaning your hang (since it was in WebKitDriver code) has little to do with Yuzo's hang (which didn't use that code). I suspect your DRT hang is related to how I have been fixing the WebKit Driver code to have proper expectations about Mac DRT, including reading from stderr as well as assuming that the text output is utf-8. It's possible that Chromium DRT is not spitting out enough #EOFs?

Eric Seidel (no email)

Comment 13 2010-04-23 13:14:53 PDT

James and I have *not* seen this hang on the chromium bots after adding the stack dumping code. The bots are expected to no longer kill the process for lack of output after 10m since it now outputs thread dumps every minute. However if it were a true hang it should still end up killing the process for taking too long. Something is fishy here.

Dirk Pranke

Comment 14 2010-04-23 14:05:57 PDT

(In reply to comment #13) > James and I have *not* seen this hang on the chromium bots after adding the > stack dumping code. > > The bots are expected to no longer kill the process for lack of output after > 10m since it now outputs thread dumps every minute. However if it were a true > hang it should still end up killing the process for taking too long. > > Something is fishy here. Note that the stack dumping will only work if the main thread can call that routine. If it can't (e.g., because it's blocked in a join(), or for some other reason), you will still have a hang.

Eric Seidel (no email)

Comment 15 2010-04-23 15:34:53 PDT

James, Dirk and I have all tried and failed to repro this locally. Anyone who is successful I would love to know.

Eric Seidel (no email)

Comment 16 2010-04-24 00:10:57 PDT

@tkent: We should break your hang off into a separate bug. I plan to roll out some of the wdiff changes I made in hopes they make the failures on the bots go away. Mostly on the off chance that we're triggering some python bug and that makes it go away. We saw this reproduce on the bots a couple times today, but the dumping code I wrote did not run as expected. It is definitely working, as you see the output earlier in the run, but once it hangs it appears to be some sort of hard hang whereby the main thread is never returning from .join() It could also be crashing somehow and buildbot is detecting the crash as a hang. Either way, sounds like a python bug. We need local reproduction of the hang to really move forward.

Eric Seidel (no email)

Comment 17 2010-04-25 11:43:06 PDT

OK. I'm going to roll out: http://trac.webkit.org/changeset/58062 http://trac.webkit.org/changeset/58060 http://trac.webkit.org/changeset/58059 http://trac.webkit.org/changeset/58055 http://trac.webkit.org/changeset/58054 and parts of: http://trac.webkit.org/changeset/58050 We'll see if that fixes this mysterious hang. I realized that although we could (and possibly should) treat the output from DumpRenderTree/test_shell as utf-8 there are some exceptions to the rule that it is always utf-8. So to make things simple, I"m restoring the previous behavior of just treating the output as binary. Additionally, I learned in fixing bug 38027 that diff ouput must be treated as binary as well. So these reverts fix that too.

Eric Seidel (no email)

Comment 18 2010-04-25 14:30:37 PDT

My suspicion is that the hang may be http://bugs.python.org/issue2320. I'm not sure why the changes would have triggered it though. Unless somehow the previous wdiff code was safe, and the new run_command code is not.

Eric Seidel (no email)

Comment 19 2010-04-25 18:29:46 PDT

Created attachment 54251 [details] Patch

Eric Seidel (no email)

Comment 20 2010-04-25 18:31:42 PDT

(In reply to comment #19) > Created an attachment (id=54251) [details] > Patch I've run test-webkitpy --all, as well as new-run-webkit-tests and run_webkit_tests.py (with a chromium checkout). I have no idea if this will fix the hangs, but it definitely makes our handling of diffs better, and treating DRT output as binary is probably best for now.

Kent Tamura

Comment 21 2010-04-25 18:58:32 PDT

(In reply to comment #11) > > I have a similar problem. > > If I run "new-run-webkit-tests --chromium --debug --use-drt > > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > > 14 DumpRenderTree process freeze. I had no problem with the source as of two > > days ago, and have the problem with the today's source. > > > > I attach an output for Comment #2. > > How do I build an test this configuration? I would like to have a local > reproduction of this bug. % update-webkit --chromium % build-dumprendertree --debug --chromium % new-run-webkit-tests --debug --chromium --use-drt > Created an attachment (id=54251) [details] [details] > Patch I tried this patch, but I still have the problem.

Eric Seidel (no email)

Comment 22 2010-04-25 19:09:01 PDT

OK. Testing tkent's setup now (which is a separate bug from the hang seen on the cr builders).

Adam Barth

Comment 23 2010-04-26 16:16:55 PDT

Comment on attachment 54251 [details] Patch Ok. The comments below are minor. WebKitTools/Scripts/webkitpy/layout_tests/driver_test.py:62 + driver.stop() To we need a finally to make sure we run this line of code? WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:569 + wdiff = wdiff.replace('##WDIFF_DEL##', '<span class=del>') Yuck WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:735 + browser_wrapper = os.environ.get("BROWSER_WRAPPER", None) I think this "None" is superfluous. WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:740 + _log.error("BROWSER_WRAPPER will be removed any time after June 1st 2010 and your scripts will break.") Yay. Environment variables are evil.

WebKit Commit Bot

Comment 24 2010-04-26 19:33:12 PDT

Comment on attachment 54251 [details] Patch Clearing flags on attachment: 54251 Committed r58279: <http://trac.webkit.org/changeset/58279>

WebKit Commit Bot

Comment 25 2010-04-26 19:33:20 PDT

All reviewed patches have been landed. Closing bug.

Eric Seidel (no email)

Comment 26 2010-04-27 00:55:54 PDT

I believe this is likely still happening. Preparing a new fix.

Eric Seidel (no email)

Comment 27 2010-04-27 01:26:08 PDT

Created attachment 54394 [details] Patch

Eric Seidel (no email)

Comment 28 2010-04-27 01:28:32 PDT

Created attachment 54395 [details] Add mising import

Adam Barth

Comment 29 2010-04-27 01:48:37 PDT

Comment on attachment 54395 [details] Add mising import WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128 + # FIXME: Callers should use scm.py instead. This comment seems unrelated, but ok. WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139 + shell=(sys.platform == 'win32'), I thought we were against shell=True? WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73 + close_fds = sys.platform not in ('win32', 'cygwin') This seems to be a repeat of the code in Executive. Is there a way for them to share?

Eric Seidel (no email)

Comment 30 2010-04-27 01:59:10 PDT

(In reply to comment #29) > (From update of attachment 54395 [details]) > WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128 > + # FIXME: Callers should use scm.py instead. > This comment seems unrelated, but ok. Yeah, I just noticed more of the scm duplication while in the file. > WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139 > + shell=(sys.platform == 'win32'), > I thought we were against shell=True? We are. > WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73 > + close_fds = sys.platform not in ('win32', 'cygwin') > This seems to be a repeat of the code in Executive. Is there a way for them to > share? server_process keeps a long-running Popen(). Executive does not yet have any sort of wrapper for that.

WebKit Commit Bot

Comment 31 2010-04-27 10:50:13 PDT

Comment on attachment 54395 [details] Add mising import Clearing flags on attachment: 54395 Committed r58314: <http://trac.webkit.org/changeset/58314>

WebKit Commit Bot

Comment 32 2010-04-27 10:50:21 PDT

All reviewed patches have been landed. Closing bug.

Eric Seidel (no email)

Comment 33 2010-04-28 17:41:32 PDT

This is still happening. I'll open a new bug.

Eric Seidel (no email)

Comment 34 2010-04-28 17:54:53 PDT

Opened bug 38300 and bug 38298 to cover remaining hangs.

Note You need to log in before you can comment on or make changes to this bug.