92299 – [Chromium-Android] NRWT better handling of DRT deadlocks and crashes

Xianzhu Wang

Reported 2012-07-25 14:14:17 PDT

The current chromium-android NRWT driver has some problems handling deadlocks and crashes: 1. deadlock detector thread kills the driver: The _deadlock_detector() method running in the deadlock detector thread holds a reference of the driver. When the driver exists before deadlock detector thread exits, the driver won't be destroyed because of the reference held by the deadlock detector thread. When the thread exits, stop() will be called from __del__ of the driver which runs 'adb force-stop org.chromium.native_tests'. This causes the current running drt app to be unexpectedly killed. 2. The retry-after-killed logic is not good: the related code is scattered, and the retry counter is global instead of counting the current retries.

Xianzhu Wang

Comment 1 2012-07-25 14:32:07 PDT

Created attachment 154448 [details] Patch

Xianzhu Wang

Comment 2 2012-07-26 13:23:33 PDT

Dirk, could you review the patch? Thanks!

Dirk Pranke

Comment 3 2012-07-26 13:34:45 PDT

Comment on attachment 154448 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=154448&action=review > Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:402 > + _log.debug('Run adb result: ' + result[:80]) nit: any particular reason to limit this to [:80], instead of the whole output? > Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:516 > time.sleep(1) typically in a slow-spin look like this I'll key this off of time instead of a count, e.g. deadline = time.time() + DRT_START_STO_SECS: while time.time() < deadline: if ...: break time.sleep(0.1) the advantage here is that your sleep granularity isn't quite so coarse; also, I think the logic can be a bit clearer. You have these slow-spins in several places in the code; it would probably be a good idea to wrap them into a loop_with_timeout() function or something to reuse the code to ensure consistency. > Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:548 > + time.sleep(1) same comment re: using deadlines instead of a range and a fixed sleep interval ... > Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:592 > + raise AssertionError('DumpRenderTree is killed by Android for too many times!') why do you need to retry the test? is there anything else running on the android device that might be causing OOMs and/or causing the DRT to get killed? It seems like you should just treat this as a crash and return and let the rest of the infrastructure in NRWT deal with handling potential flakiness? > Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234 > + _log.debug('unexpected EOF of stdout') why did you change these?

Xianzhu Wang

Comment 4 2012-07-26 14:14:08 PDT

Comment on attachment 154448 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=154448&action=review >> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:402 >> + _log.debug('Run adb result: ' + result[:80]) > > nit: any particular reason to limit this to [:80], instead of the whole output? Some outputs of commands like 'adb logcat -d', 'cat /data/tombstones/tombstones01' are very long, and normally these outputs will be logged after the run_adb_command() call. So I limit the length of output here. Will add comment here. >> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:516 >> time.sleep(1) > > typically in a slow-spin look like this I'll key this off of time instead of a count, e.g. > deadline = time.time() + DRT_START_STO_SECS: > while time.time() < deadline: > if ...: > break > time.sleep(0.1) > > the advantage here is that your sleep granularity isn't quite so coarse; also, I think the logic can be a bit clearer. > > You have these slow-spins in several places in the code; it would probably be a good idea to wrap them into a loop_with_timeout() function or something to reuse the code to ensure consistency. Good suggestion. Will make the change. >> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:592 >> + raise AssertionError('DumpRenderTree is killed by Android for too many times!') > > why do you need to retry the test? is there anything else running on the android device that might be causing OOMs and/or causing the DRT to get killed? It seems like you should just treat this as a crash and return and let the rest of the infrastructure in NRWT deal with handling potential flakiness? The Android system kills processes frequently when needed so it doesn't always mean a failure of the test. Hao added the loop when he observed some unexpected failures. However, now we run DRT as an apk so the situation might be different. Perhaps we can try to remove the loop and see how it works. >> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234 >> + _log.debug('unexpected EOF of stdout') > > why did you change these? With _log.warning() the message will be printed on each crashing test on Android, which looks not good. I think a crashing test is a normal case for the driver script, so changed it to debug.

Xianzhu Wang

Comment 5 2012-07-26 14:15:05 PDT

Hao, could you provide more information about retry on killed by the system? How often did it occur?

Dirk Pranke

Comment 6 2012-07-26 14:17:41 PDT

(In reply to comment #4) > >> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234 > >> + _log.debug('unexpected EOF of stdout') > > > > why did you change these? > > With _log.warning() the message will be printed on each crashing test on Android, which looks not good. I think a crashing test is a normal case for the driver script, so changed it to debug. I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive?

Xianzhu Wang

Comment 7 2012-07-26 14:37:11 PDT

(In reply to comment #6) > I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive? It is almost always the case on Android. The pipe almost always terminates before the 'adb shell' process when the test crashes. On Android with the _log.warning() (even with 'and not self.poll()') condition, every crash will output the message. However, this is not a big issue, and I'd be glad to keep the warning if it's useful for other ports.

Dirk Pranke

Comment 8 2012-07-26 14:43:37 PDT

(In reply to comment #7) > (In reply to comment #6) > > I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive? > > It is almost always the case on Android. The pipe almost always terminates before the 'adb shell' process when the test crashes. On Android with the _log.warning() (even with 'and not self.poll()') condition, every crash will output the message. > > However, this is not a big issue, and I'd be glad to keep the warning if it's useful for other ports. Sorry, maybe I wasn't clear. If we hit this condition, and the test didn't crash (on any port), this indicates a bug. If the test did crash, then I don't care about this condition at all (because I wouldn't expect there to be a #EOF). So can you leave this as a warning, but only print it if the test didn't crash?

Xianzhu Wang

Comment 9 2012-07-26 14:55:52 PDT

(In reply to comment #8) > Sorry, maybe I wasn't clear. > If we hit this condition, and the test didn't crash (on any port), this indicates a bug. This is not true on Android, because the condition itself is our method to detect crash. The 'adb shell' subprocess will be almost always live when a test crashes. When we detect EOF, we know the crash and will kill the 'adb shell' subprocess. > If the test did crash, then I don't care about this condition at all (because I wouldn't expect there to be a #EOF). So can you leave this as a warning, but only print it if the test didn't crash?

Dirk Pranke

Comment 10 2012-07-26 15:07:28 PDT

(In reply to comment #9) > (In reply to comment #8) > > Sorry, maybe I wasn't clear. > > If we hit this condition, and the test didn't crash (on any port), this indicates a bug. > > This is not true on Android, because the condition itself is our method to detect crash. The 'adb shell' subprocess will be almost always live when a test crashes. When we detect EOF, we know the crash and will kill the 'adb shell' subprocess. Oh, I see. Well, the warning indicates a bug on the other platforms, so either we need to provide a way to override this for android, or leave it alone :).

Xianzhu Wang

Comment 11 2012-07-26 15:50:34 PDT

Created attachment 154775 [details] Patch

Dirk Pranke

Comment 12 2012-07-26 15:58:51 PDT

Comment on attachment 154775 [details] Patch much cleaner. thanks!

WebKit Review Bot

Comment 13 2012-07-26 20:33:36 PDT

Comment on attachment 154775 [details] Patch Rejecting attachment 154775 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2 Last 500 characters of output: Kit/chromium/third_party/yasm/source/patched-yasm --revision 134927 --non-interactive --force --accept theirs-conflict --ignore-externals' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' 48>At revision 134927. ________ running '/usr/bin/python tools/clang/scripts/update.py --mac-only' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' ________ running '/usr/bin/python gyp_webkit' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' Updating webkit projects from gyp files... Full output: http://queues.webkit.org/results/13378119

Xianzhu Wang

Comment 14 2012-07-26 20:58:09 PDT

Comment on attachment 154775 [details] Patch OOPS! Won't CQ update the OOPS line? Try again.

WebKit Review Bot

Comment 15 2012-07-26 22:10:49 PDT

Comment on attachment 154775 [details] Patch Rejecting attachment 154775 [details] from commit-queue. Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2 Last 500 characters of output: Kit/chromium/third_party/yasm/source/patched-yasm --revision 134927 --non-interactive --force --accept theirs-conflict --ignore-externals' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' 48>At revision 134927. ________ running '/usr/bin/python tools/clang/scripts/update.py --mac-only' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' ________ running '/usr/bin/python gyp_webkit' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium' Updating webkit projects from gyp files... Full output: http://queues.webkit.org/results/13373177

Adam Barth

Comment 16 2012-07-26 22:36:47 PDT

Comment on attachment 154775 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=154775&action=review > Tools/ChangeLog:14 > + Additional information of the change such as approach, rationale. Please add per-function descriptions below (OOPS!). You need to remove this line yourself.

Xianzhu Wang

Comment 17 2012-07-27 09:04:51 PDT

Created attachment 154967 [details] Patch

Xianzhu Wang

Comment 18 2012-07-27 09:05:40 PDT

(In reply to comment #16) > (From update of attachment 154775 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=154775&action=review > > > Tools/ChangeLog:14 > > + Additional information of the change such as approach, rationale. Please add per-function descriptions below (OOPS!). > > You need to remove this line yourself. Oh, I missed that OOPS. I should have removed it when editing ChangeLog.

WebKit Review Bot

Comment 19 2012-07-27 11:42:53 PDT

Comment on attachment 154967 [details] Patch Clearing flags on attachment: 154967 Committed r123900: <http://trac.webkit.org/changeset/123900>

WebKit Review Bot

Comment 20 2012-07-27 11:42:58 PDT

All reviewed patches have been landed. Closing bug.

Attachments
Patch (12.24 KB, patch) 2012-07-25 14:32 PDT, Xianzhu Wang	no flags	Details Formatted Diff Diff
Patch (11.56 KB, patch) 2012-07-26 15:50 PDT, Xianzhu Wang	no flags	Details Formatted Diff Diff
Patch (11.46 KB, patch) 2012-07-27 09:04 PDT, Xianzhu Wang	no flags	Details Formatted Diff Diff
Show Obsolete (2) View All Add attachment proposed patch, testcase, etc.