Bug 37987

Summary: [chromium] new-run-webkit-tests hangs on Chromium Bots (OS X and Linux)
Product: WebKit Reporter: Yuzo Fujishima <yuzo>
Component: Tools / TestsAssignee: Nobody <webkit-unassigned>
Status: RESOLVED FIXED    
Severity: Normal CC: cjerdonek, commit-queue, dpranke, eric, jamesr, ojan, tkent, tony
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Mac (Intel)   
OS: OS X 10.5   
Bug Depends on: 38011    
Bug Blocks: 38300    
Attachments:
Description Flags
stack dump
none
Patch
none
Patch
none
Add mising import none

Description Yuzo Fujishima 2010-04-22 05:22:44 PDT
Revision: 58075 (and later)

When invoked from <CHROMIUM>/src/webkit/tools/layout_tests/run_webkit_tests.py,
WebKitTools/Scripts/new-run-webkit-tests hangs until it times out after 600 seconds.

I observed this both in my local (Chromium) environment:
  ./webkit/tools/layout_tests/run_webkit_tests.py  --debug fast/[a-b]* --verbose
and in Chromium's canary bots:
  http://build.chromium.org/buildbot/waterfall.fyi/waterfall?builder=Webkit+Mac+(webkit.org)
  (Sometimes the tests go through, sometimes hang.)

As far as I tested, the hang is not caused by a specific test(s) -- the test being run by the
hanging thread differs every time I tried (a few dozen times).

-- Local Environment --

Traceback (collected by signaling):

  File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/new-run-webkit-tests", line 55, in <module>
    sys.exit(run_webkit_tests.main(options, args))
  File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1628, in main
    num_unexpected_results = test_runner.run(result_summary, print_results)
  File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 678, in run
    self._run_tests(self._test_files_list, result_summary))
  File "/Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 622, in _run_tests
    thread.join(0.1)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 569, in join
    self.__block.wait(delay)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 233, in wait
    _sleep(delay)

Processes:

 7163 ttys000    0:00.03 /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python ./webkit/tools/layout_tests/run_webkit_tests.py --debug fast/[a-b]* --verbose
 7165 ttys000    0:07.77 /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python /Users/yuzo/src/chromium/src/third_party/WebKit/WebKitTools/Scripts/new-run-webkit-tests --chromium --debug fast/[a-b]* --verbose
 7170 ttys000    0:00.02 /Users/yuzo/src/chromium/src/xcodebuild/Debug/layout_test_helper
 7183 ttys000    0:01.58 /Users/yuzo/src/chromium/src/xcodebuild/Debug/TestShell.app/Contents/MacOS/TestShell --layout-tests --pixel-tests=/Users/yuzo/src/chromium/src/webkit/Debug/layout-test-results/png_result11.png

-- Chromium Canary Bot --

http://build.chromium.org/buildbot/waterfall.fyi/builders/Webkit%20Mac%20(webkit.org)/builds/18643/steps/webkit_tests/logs/stdio

...
100422 03:48:12 dump_render_tree_thread.py:348  DEBUG Thread-1 fast/events/onunload.html passed
100422 03:48:12 dump_render_tree_thread.py:348  DEBUG Thread-2 editing/selection/thai-word-at-document-end.html passed
100422 03:48:12 dump_render_tree_thread.py:348  DEBUG Thread-2 editing/selection/toString-1.html passed
100422 03:48:12 dump_render_tree_thread.py:348  DEBUG Thread-2 editing/selection/toString.html passed
100422 03:48:12 dump_render_tree_thread.py:348  DEBUG Thread-1 fast/events/open-window-from-another-frame.html passed

command timed out: 600 seconds without output, killing pid 20832
process killed by signal 9
program finished with exit code -1
elapsedTime=840.133609
Comment 1 Eric Seidel (no email) 2010-04-22 11:58:21 PDT
We need a stack trace from the hung threads to diagnose.

I'm attempting to build chromium now so that I can reproduce.
Comment 2 Eric Seidel (no email) 2010-04-22 12:54:20 PDT
OK.  I'm not able to reproduce this (yet).

Someone who is, please add the following code to your copy (any file that gets run is fine):

import signal
import threading
import traceback
def dumpstacks(signal, frame):
    code = []
    for threadId, stack in sys._current_frames().items():
        code.append("\n# Thread: %d" % threadId)
        for filename, lineno, name, line in traceback.extract_stack(stack):
            code.append('File: "%s", line %d, in %s' % (filename, lineno, name))
            if line:
                code.append("  %s" % (line.strip()))
    print "\n".join(code)

signal.signal(signal.SIGQUIT, dumpstacks)

and then use:

kill -QUIT PROCESS_ID

And attach the dump of the stacks to this bug.  Once we have that information, we can solve this.
Comment 3 Eric Seidel (no email) 2010-04-22 12:55:24 PDT
It would be possible to add that sort of thread dumping code to run-webkit-tests for normal --verbose mode operation and have it run any time it gets a SIGTERM (what it should be getting from the bots).
Comment 4 Eric Seidel (no email) 2010-04-22 14:07:33 PDT
So far James and I have not been able to reproduce this failure locally.  Yuzo said he could last night.

Unfortunately, we can't easily debug on the bots, because BuildBot sends SIGKILL to kill the process instead of SIGTERM so there is no debug logging we can do w/o writing additional watch timers or wrapper scripts.
http://buildbot.net/trac/ticket/751
Comment 5 Eric Seidel (no email) 2010-04-22 14:12:43 PDT
I'm not sure how:
./webkit/tools/layout_tests/run_webkit_tests.py  --debug fast/[a-b]*
--verbose

works.

my shell would try to eat the "*", and even passing just "fast" doesn't find any tests.
Comment 6 Eric Seidel (no email) 2010-04-22 16:08:33 PDT
Adding some debugging code to help with this in bug 38011.
Comment 7 Kent Tamura 2010-04-22 22:59:24 PDT
Created attachment 54132 [details]
stack dump

I have a similar problem.
If I run "new-run-webkit-tests --chromium --debug --use-drt --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and 14 DumpRenderTree process freeze.  I had no problem with the source as of two days ago, and have the problem with the today's source.

I attach an output for Comment #2.
Comment 8 Eric Seidel (no email) 2010-04-23 00:20:12 PDT
We have testing code running on the canary bots.  We just need one of the stacktrace sets from that testing code as soon as someone notices a bot hang.
Comment 9 Kent Tamura 2010-04-23 01:01:40 PDT
(In reply to comment #7)
> Created an attachment (id=54132) [details]
> stack dump
> 
> I have a similar problem.
> If I run "new-run-webkit-tests --chromium --debug --use-drt
> --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and
> 14 DumpRenderTree process freeze.  I had no problem with the source as of two
> days ago, and have the problem with the today's source.
> 
> I attach an output for Comment #2.

For my case,
r58044 "Unreviewed, just adding missing ":" in python file." was ok, and
r58050 "Unreviewed. Attempt one more time to fix NRWT for chromium." made this problem.
Comment 10 Eric Seidel (no email) 2010-04-23 12:54:24 PDT
I theorized with tony that we could be passing non-utf8 strings to the test_shell process and getting it to hang.

http://trac.webkit.org/changeset/58050 is not related to test_shell communication at all, so I don't see why it would cause hangs in reading from test_shell.
Comment 11 Eric Seidel (no email) 2010-04-23 13:05:56 PDT
(In reply to comment #7)
> Created an attachment (id=54132) [details]
> stack dump
> 
> I have a similar problem.
> If I run "new-run-webkit-tests --chromium --debug --use-drt
> --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and
> 14 DumpRenderTree process freeze.  I had no problem with the source as of two
> days ago, and have the problem with the today's source.
> 
> I attach an output for Comment #2.

How do I build an test this configuration?  I would like to have a local reproduction of this bug.
Comment 12 Eric Seidel (no email) 2010-04-23 13:08:44 PDT
(In reply to comment #11)
> (In reply to comment #7)
> > Created an attachment (id=54132) [details] [details]
> > stack dump
> > 
> > I have a similar problem.
> > If I run "new-run-webkit-tests --chromium --debug --use-drt
> > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and
> > 14 DumpRenderTree process freeze.  I had no problem with the source as of two
> > days ago, and have the problem with the today's source.
> > 
> > I attach an output for Comment #2.
> 
> How do I build an test this configuration?  I would like to have a local
> reproduction of this bug.


Also, DumpRenderTree uses a totally different Driver than test_shell, meaning your hang (since it was in WebKitDriver code) has little to do with Yuzo's hang (which didn't use that code).

I suspect your DRT hang is related to how I have been fixing the WebKit Driver code to have proper expectations about Mac DRT, including reading from stderr as well as assuming that the text output is utf-8.  It's possible that Chromium DRT is not spitting out enough #EOFs?
Comment 13 Eric Seidel (no email) 2010-04-23 13:14:53 PDT
James and I have *not* seen this hang on the chromium bots after adding the stack dumping code.

The bots are expected to no longer kill the process for lack of output after 10m since it now outputs thread dumps every minute.  However if it were a true hang it should still end up killing the process for taking too long.

Something is fishy here.
Comment 14 Dirk Pranke 2010-04-23 14:05:57 PDT
(In reply to comment #13)
> James and I have *not* seen this hang on the chromium bots after adding the
> stack dumping code.
> 
> The bots are expected to no longer kill the process for lack of output after
> 10m since it now outputs thread dumps every minute.  However if it were a true
> hang it should still end up killing the process for taking too long.
> 
> Something is fishy here.

Note that the stack dumping will only work if the main thread can call that routine. If it can't (e.g., because it's blocked in a join(), or for some other reason), you will still have a hang.
Comment 15 Eric Seidel (no email) 2010-04-23 15:34:53 PDT
James, Dirk and I have all tried and failed to repro this locally.  Anyone who is successful I would love to know.
Comment 16 Eric Seidel (no email) 2010-04-24 00:10:57 PDT
@tkent:  We should break your hang off into a separate bug.

I plan to roll out some of the wdiff changes I made in hopes they make the failures on the bots go away.  Mostly on the off chance that we're triggering some python bug and that makes it go away.

We saw this reproduce on the bots a couple times today, but the dumping code I wrote did not run as expected.  It is definitely working, as you see the output earlier in the run, but once it hangs it appears to be some sort of hard hang whereby the main thread is never returning from .join()

It could also be crashing somehow and buildbot is detecting the crash as a hang.  Either way, sounds like a python bug.

We need local reproduction of the hang to really move forward.
Comment 17 Eric Seidel (no email) 2010-04-25 11:43:06 PDT
OK.  I'm going to roll out:
http://trac.webkit.org/changeset/58062
http://trac.webkit.org/changeset/58060
http://trac.webkit.org/changeset/58059
http://trac.webkit.org/changeset/58055
http://trac.webkit.org/changeset/58054
and parts of:
http://trac.webkit.org/changeset/58050

We'll see if that fixes this mysterious hang.  I realized that although we could (and possibly should) treat the output from DumpRenderTree/test_shell as utf-8 there are some exceptions to the rule that it is always utf-8.  So to make things simple, I"m restoring the previous behavior of just treating the output as binary.

Additionally, I learned in fixing bug 38027 that diff ouput must be treated as binary as well.  So these reverts fix that too.
Comment 18 Eric Seidel (no email) 2010-04-25 14:30:37 PDT
My suspicion is that the hang may be http://bugs.python.org/issue2320.  I'm not sure why the changes would have triggered it though.  Unless somehow the previous wdiff code was safe, and the new run_command code is not.
Comment 19 Eric Seidel (no email) 2010-04-25 18:29:46 PDT
Created attachment 54251 [details]
Patch
Comment 20 Eric Seidel (no email) 2010-04-25 18:31:42 PDT
(In reply to comment #19)
> Created an attachment (id=54251) [details]
> Patch

I've run test-webkitpy --all, as well as new-run-webkit-tests and run_webkit_tests.py (with a chromium checkout).

I have no idea if this will fix the hangs, but it definitely makes our handling of diffs better, and treating DRT output as binary is probably best for now.
Comment 21 Kent Tamura 2010-04-25 18:58:32 PDT
(In reply to comment #11)
> > I have a similar problem.
> > If I run "new-run-webkit-tests --chromium --debug --use-drt
> > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and
> > 14 DumpRenderTree process freeze.  I had no problem with the source as of two
> > days ago, and have the problem with the today's source.
> > 
> > I attach an output for Comment #2.
> 
> How do I build an test this configuration?  I would like to have a local
> reproduction of this bug.

% update-webkit --chromium
% build-dumprendertree --debug --chromium
% new-run-webkit-tests --debug --chromium --use-drt


> Created an attachment (id=54251) [details] [details]
> Patch

I tried this patch, but I still have the problem.
Comment 22 Eric Seidel (no email) 2010-04-25 19:09:01 PDT
OK.  Testing tkent's setup now (which is a separate bug from the hang seen on the cr builders).
Comment 23 Adam Barth 2010-04-26 16:16:55 PDT
Comment on attachment 54251 [details]
Patch

Ok.  The comments below are minor.

WebKitTools/Scripts/webkitpy/layout_tests/driver_test.py:62
 +      driver.stop()
To we need a finally to make sure we run this line of code?

WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:569
 +                  wdiff = wdiff.replace('##WDIFF_DEL##', '<span class=del>')
Yuck

WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:735
 +          browser_wrapper = os.environ.get("BROWSER_WRAPPER", None)
I think this "None" is superfluous.

WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:740
 +              _log.error("BROWSER_WRAPPER will be removed any time after June 1st 2010 and your scripts will break.")
Yay.  Environment variables are evil.
Comment 24 WebKit Commit Bot 2010-04-26 19:33:12 PDT
Comment on attachment 54251 [details]
Patch

Clearing flags on attachment: 54251

Committed r58279: <http://trac.webkit.org/changeset/58279>
Comment 25 WebKit Commit Bot 2010-04-26 19:33:20 PDT
All reviewed patches have been landed.  Closing bug.
Comment 26 Eric Seidel (no email) 2010-04-27 00:55:54 PDT
I believe this is likely still happening.  Preparing a new fix.
Comment 27 Eric Seidel (no email) 2010-04-27 01:26:08 PDT
Created attachment 54394 [details]
Patch
Comment 28 Eric Seidel (no email) 2010-04-27 01:28:32 PDT
Created attachment 54395 [details]
Add mising import
Comment 29 Adam Barth 2010-04-27 01:48:37 PDT
Comment on attachment 54395 [details]
Add mising import

WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128
 +      # FIXME: Callers should use scm.py instead.
This comment seems unrelated, but ok.

WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139
 +                                        shell=(sys.platform == 'win32'),
I thought we were against shell=True?

WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73
 +          close_fds = sys.platform not in ('win32', 'cygwin')
This seems to be a repeat of the code in Executive.  Is there a way for them to share?
Comment 30 Eric Seidel (no email) 2010-04-27 01:59:10 PDT
(In reply to comment #29)
> (From update of attachment 54395 [details])
> WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128
>  +      # FIXME: Callers should use scm.py instead.
> This comment seems unrelated, but ok.

Yeah, I just noticed more of the scm duplication while in the file.

> WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139
>  +                                        shell=(sys.platform == 'win32'),
> I thought we were against shell=True?

We are.

> WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73
>  +          close_fds = sys.platform not in ('win32', 'cygwin')
> This seems to be a repeat of the code in Executive.  Is there a way for them to
> share?

server_process keeps a long-running Popen().  Executive does not yet have any sort of wrapper for that.
Comment 31 WebKit Commit Bot 2010-04-27 10:50:13 PDT
Comment on attachment 54395 [details]
Add mising import

Clearing flags on attachment: 54395

Committed r58314: <http://trac.webkit.org/changeset/58314>
Comment 32 WebKit Commit Bot 2010-04-27 10:50:21 PDT
All reviewed patches have been landed.  Closing bug.
Comment 33 Eric Seidel (no email) 2010-04-28 17:41:32 PDT
This is still happening.  I'll open a new bug.
Comment 34 Eric Seidel (no email) 2010-04-28 17:54:53 PDT
Opened bug 38300 and bug 38298 to cover remaining hangs.