Summary: | [chromium] new-run-webkit-tests hangs on Chromium Bots (OS X and Linux) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | WebKit | Reporter: | Yuzo Fujishima <yuzo> | ||||||||||
Component: | Tools / Tests | Assignee: | Nobody <webkit-unassigned> | ||||||||||
Status: | RESOLVED FIXED | ||||||||||||
Severity: | Normal | CC: | cjerdonek, commit-queue, dpranke, eric, jamesr, ojan, tkent, tony | ||||||||||
Priority: | P2 | ||||||||||||
Version: | 528+ (Nightly build) | ||||||||||||
Hardware: | Mac (Intel) | ||||||||||||
OS: | OS X 10.5 | ||||||||||||
Bug Depends on: | 38011 | ||||||||||||
Bug Blocks: | 38300 | ||||||||||||
Attachments: |
|
Description
Yuzo Fujishima
2010-04-22 05:22:44 PDT
We need a stack trace from the hung threads to diagnose. I'm attempting to build chromium now so that I can reproduce. OK. I'm not able to reproduce this (yet). Someone who is, please add the following code to your copy (any file that gets run is fine): import signal import threading import traceback def dumpstacks(signal, frame): code = [] for threadId, stack in sys._current_frames().items(): code.append("\n# Thread: %d" % threadId) for filename, lineno, name, line in traceback.extract_stack(stack): code.append('File: "%s", line %d, in %s' % (filename, lineno, name)) if line: code.append(" %s" % (line.strip())) print "\n".join(code) signal.signal(signal.SIGQUIT, dumpstacks) and then use: kill -QUIT PROCESS_ID And attach the dump of the stacks to this bug. Once we have that information, we can solve this. It would be possible to add that sort of thread dumping code to run-webkit-tests for normal --verbose mode operation and have it run any time it gets a SIGTERM (what it should be getting from the bots). So far James and I have not been able to reproduce this failure locally. Yuzo said he could last night. Unfortunately, we can't easily debug on the bots, because BuildBot sends SIGKILL to kill the process instead of SIGTERM so there is no debug logging we can do w/o writing additional watch timers or wrapper scripts. http://buildbot.net/trac/ticket/751 I'm not sure how: ./webkit/tools/layout_tests/run_webkit_tests.py --debug fast/[a-b]* --verbose works. my shell would try to eat the "*", and even passing just "fast" doesn't find any tests. Adding some debugging code to help with this in bug 38011. Created attachment 54132 [details] stack dump I have a similar problem. If I run "new-run-webkit-tests --chromium --debug --use-drt --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and 14 DumpRenderTree process freeze. I had no problem with the source as of two days ago, and have the problem with the today's source. I attach an output for Comment #2. We have testing code running on the canary bots. We just need one of the stacktrace sets from that testing code as soon as someone notices a bot hang. (In reply to comment #7) > Created an attachment (id=54132) [details] > stack dump > > I have a similar problem. > If I run "new-run-webkit-tests --chromium --debug --use-drt > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > 14 DumpRenderTree process freeze. I had no problem with the source as of two > days ago, and have the problem with the today's source. > > I attach an output for Comment #2. For my case, r58044 "Unreviewed, just adding missing ":" in python file." was ok, and r58050 "Unreviewed. Attempt one more time to fix NRWT for chromium." made this problem. I theorized with tony that we could be passing non-utf8 strings to the test_shell process and getting it to hang. http://trac.webkit.org/changeset/58050 is not related to test_shell communication at all, so I don't see why it would cause hangs in reading from test_shell. (In reply to comment #7) > Created an attachment (id=54132) [details] > stack dump > > I have a similar problem. > If I run "new-run-webkit-tests --chromium --debug --use-drt > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > 14 DumpRenderTree process freeze. I had no problem with the source as of two > days ago, and have the problem with the today's source. > > I attach an output for Comment #2. How do I build an test this configuration? I would like to have a local reproduction of this bug. (In reply to comment #11) > (In reply to comment #7) > > Created an attachment (id=54132) [details] [details] > > stack dump > > > > I have a similar problem. > > If I run "new-run-webkit-tests --chromium --debug --use-drt > > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > > 14 DumpRenderTree process freeze. I had no problem with the source as of two > > days ago, and have the problem with the today's source. > > > > I attach an output for Comment #2. > > How do I build an test this configuration? I would like to have a local > reproduction of this bug. Also, DumpRenderTree uses a totally different Driver than test_shell, meaning your hang (since it was in WebKitDriver code) has little to do with Yuzo's hang (which didn't use that code). I suspect your DRT hang is related to how I have been fixing the WebKit Driver code to have proper expectations about Mac DRT, including reading from stderr as well as assuming that the text output is utf-8. It's possible that Chromium DRT is not spitting out enough #EOFs? James and I have *not* seen this hang on the chromium bots after adding the stack dumping code. The bots are expected to no longer kill the process for lack of output after 10m since it now outputs thread dumps every minute. However if it were a true hang it should still end up killing the process for taking too long. Something is fishy here. (In reply to comment #13) > James and I have *not* seen this hang on the chromium bots after adding the > stack dumping code. > > The bots are expected to no longer kill the process for lack of output after > 10m since it now outputs thread dumps every minute. However if it were a true > hang it should still end up killing the process for taking too long. > > Something is fishy here. Note that the stack dumping will only work if the main thread can call that routine. If it can't (e.g., because it's blocked in a join(), or for some other reason), you will still have a hang. James, Dirk and I have all tried and failed to repro this locally. Anyone who is successful I would love to know. @tkent: We should break your hang off into a separate bug. I plan to roll out some of the wdiff changes I made in hopes they make the failures on the bots go away. Mostly on the off chance that we're triggering some python bug and that makes it go away. We saw this reproduce on the bots a couple times today, but the dumping code I wrote did not run as expected. It is definitely working, as you see the output earlier in the run, but once it hangs it appears to be some sort of hard hang whereby the main thread is never returning from .join() It could also be crashing somehow and buildbot is detecting the crash as a hang. Either way, sounds like a python bug. We need local reproduction of the hang to really move forward. OK. I'm going to roll out: http://trac.webkit.org/changeset/58062 http://trac.webkit.org/changeset/58060 http://trac.webkit.org/changeset/58059 http://trac.webkit.org/changeset/58055 http://trac.webkit.org/changeset/58054 and parts of: http://trac.webkit.org/changeset/58050 We'll see if that fixes this mysterious hang. I realized that although we could (and possibly should) treat the output from DumpRenderTree/test_shell as utf-8 there are some exceptions to the rule that it is always utf-8. So to make things simple, I"m restoring the previous behavior of just treating the output as binary. Additionally, I learned in fixing bug 38027 that diff ouput must be treated as binary as well. So these reverts fix that too. My suspicion is that the hang may be http://bugs.python.org/issue2320. I'm not sure why the changes would have triggered it though. Unless somehow the previous wdiff code was safe, and the new run_command code is not. Created attachment 54251 [details]
Patch
(In reply to comment #19) > Created an attachment (id=54251) [details] > Patch I've run test-webkitpy --all, as well as new-run-webkit-tests and run_webkit_tests.py (with a chromium checkout). I have no idea if this will fix the hangs, but it definitely makes our handling of diffs better, and treating DRT output as binary is probably best for now. (In reply to comment #11) > > I have a similar problem. > > If I run "new-run-webkit-tests --chromium --debug --use-drt > > --experimental-fully-parallel", 2 DumpRenderTree processes work correctly and > > 14 DumpRenderTree process freeze. I had no problem with the source as of two > > days ago, and have the problem with the today's source. > > > > I attach an output for Comment #2. > > How do I build an test this configuration? I would like to have a local > reproduction of this bug. % update-webkit --chromium % build-dumprendertree --debug --chromium % new-run-webkit-tests --debug --chromium --use-drt > Created an attachment (id=54251) [details] [details] > Patch I tried this patch, but I still have the problem. OK. Testing tkent's setup now (which is a separate bug from the hang seen on the cr builders). Comment on attachment 54251 [details]
Patch
Ok. The comments below are minor.
WebKitTools/Scripts/webkitpy/layout_tests/driver_test.py:62
+ driver.stop()
To we need a finally to make sure we run this line of code?
WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:569
+ wdiff = wdiff.replace('##WDIFF_DEL##', '<span class=del>')
Yuck
WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:735
+ browser_wrapper = os.environ.get("BROWSER_WRAPPER", None)
I think this "None" is superfluous.
WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:740
+ _log.error("BROWSER_WRAPPER will be removed any time after June 1st 2010 and your scripts will break.")
Yay. Environment variables are evil.
Comment on attachment 54251 [details] Patch Clearing flags on attachment: 54251 Committed r58279: <http://trac.webkit.org/changeset/58279> All reviewed patches have been landed. Closing bug. I believe this is likely still happening. Preparing a new fix. Created attachment 54394 [details]
Patch
Created attachment 54395 [details]
Add mising import
Comment on attachment 54395 [details]
Add mising import
WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128
+ # FIXME: Callers should use scm.py instead.
This comment seems unrelated, but ok.
WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139
+ shell=(sys.platform == 'win32'),
I thought we were against shell=True?
WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73
+ close_fds = sys.platform not in ('win32', 'cygwin')
This seems to be a repeat of the code in Executive. Is there a way for them to share?
(In reply to comment #29) > (From update of attachment 54395 [details]) > WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:128 > + # FIXME: Callers should use scm.py instead. > This comment seems unrelated, but ok. Yeah, I just noticed more of the scm duplication while in the file. > WebKitTools/Scripts/webkitpy/layout_tests/layout_package/json_results_generator.py:139 > + shell=(sys.platform == 'win32'), > I thought we were against shell=True? We are. > WebKitTools/Scripts/webkitpy/layout_tests/port/server_process.py:73 > + close_fds = sys.platform not in ('win32', 'cygwin') > This seems to be a repeat of the code in Executive. Is there a way for them to > share? server_process keeps a long-running Popen(). Executive does not yet have any sort of wrapper for that. Comment on attachment 54395 [details] Add mising import Clearing flags on attachment: 54395 Committed r58314: <http://trac.webkit.org/changeset/58314> All reviewed patches have been landed. Closing bug. This is still happening. I'll open a new bug. |