Bug 92299

Summary: [Chromium-Android] NRWT better handling of DRT deadlocks and crashes
Product: WebKit Reporter: Xianzhu Wang <wangxianzhu>
Component: Tools / TestsAssignee: Xianzhu Wang <wangxianzhu>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, dpranke, ojan, peter, webkit.review.bot, zhenghao
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
Patch
none
Patch none

Description Xianzhu Wang 2012-07-25 14:14:17 PDT
The current chromium-android NRWT driver has some problems handling deadlocks and crashes:

1. deadlock detector thread kills the driver: The _deadlock_detector() method running in the deadlock detector thread holds a reference of the driver. When the driver exists before deadlock detector thread exits, the driver won't be destroyed because of the reference held by the deadlock detector thread. When the thread exits, stop() will be called from __del__ of the driver which runs 'adb force-stop org.chromium.native_tests'. This causes the current running drt app to be unexpectedly killed.

2. The retry-after-killed logic is not good: the related code is scattered, and the retry counter is global instead of counting the current retries.
Comment 1 Xianzhu Wang 2012-07-25 14:32:07 PDT
Created attachment 154448 [details]
Patch
Comment 2 Xianzhu Wang 2012-07-26 13:23:33 PDT
Dirk, could you review the patch? Thanks!
Comment 3 Dirk Pranke 2012-07-26 13:34:45 PDT
Comment on attachment 154448 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=154448&action=review

> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:402
> +        _log.debug('Run adb result: ' + result[:80])

nit: any particular reason to limit this to [:80], instead of the whole output?

> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:516
>              time.sleep(1)

typically in a slow-spin look like this I'll key this off of time instead of a count, e.g.
deadline = time.time() + DRT_START_STO_SECS:
while time.time() < deadline:
   if ...:
     break
   time.sleep(0.1)

the advantage here is that your sleep granularity isn't quite so coarse; also, I think the logic can be a bit clearer. 

You have these slow-spins in several places in the code; it would probably be a good idea to wrap them into a loop_with_timeout() function or something to reuse the code to ensure consistency.

> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:548
> +                time.sleep(1)

same comment re: using deadlines instead of a range and a fixed sleep interval ...

> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:592
> +        raise AssertionError('DumpRenderTree is killed by Android for too many times!')

why do you need to retry the test? is there anything else running on the android device that might be causing OOMs and/or causing the DRT to get killed? It seems like you should just treat this as a crash and return and let the rest of the infrastructure in NRWT deal with handling potential flakiness?

> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234
> +                    _log.debug('unexpected EOF of stdout')

why did you change these?
Comment 4 Xianzhu Wang 2012-07-26 14:14:08 PDT
Comment on attachment 154448 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=154448&action=review

>> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:402
>> +        _log.debug('Run adb result: ' + result[:80])
> 
> nit: any particular reason to limit this to [:80], instead of the whole output?

Some outputs of commands like 'adb logcat -d', 'cat /data/tombstones/tombstones01' are very long, and normally these outputs will be logged after the run_adb_command() call. So I limit the length of output here.

Will add comment here.

>> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:516
>>              time.sleep(1)
> 
> typically in a slow-spin look like this I'll key this off of time instead of a count, e.g.
> deadline = time.time() + DRT_START_STO_SECS:
> while time.time() < deadline:
>    if ...:
>      break
>    time.sleep(0.1)
> 
> the advantage here is that your sleep granularity isn't quite so coarse; also, I think the logic can be a bit clearer. 
> 
> You have these slow-spins in several places in the code; it would probably be a good idea to wrap them into a loop_with_timeout() function or something to reuse the code to ensure consistency.

Good suggestion. Will make the change.

>> Tools/Scripts/webkitpy/layout_tests/port/chromium_android.py:592
>> +        raise AssertionError('DumpRenderTree is killed by Android for too many times!')
> 
> why do you need to retry the test? is there anything else running on the android device that might be causing OOMs and/or causing the DRT to get killed? It seems like you should just treat this as a crash and return and let the rest of the infrastructure in NRWT deal with handling potential flakiness?

The Android system kills processes frequently when needed so it doesn't always mean a failure of the test. Hao added the loop when he observed some unexpected failures. However, now we run DRT as an apk so the situation might be different. Perhaps we can try to remove the loop and see how it works.

>> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234
>> +                    _log.debug('unexpected EOF of stdout')
> 
> why did you change these?

With _log.warning() the message will be printed on each crashing test on Android, which looks not good. I think a crashing test is a normal case for the driver script, so changed it to debug.
Comment 5 Xianzhu Wang 2012-07-26 14:15:05 PDT
Hao, could you provide more information about retry on killed by the system? How often did it occur?
Comment 6 Dirk Pranke 2012-07-26 14:17:41 PDT
(In reply to comment #4)
> >> Tools/Scripts/webkitpy/layout_tests/port/server_process.py:234
> >> +                    _log.debug('unexpected EOF of stdout')
> > 
> > why did you change these?
> 
> With _log.warning() the message will be printed on each crashing test on Android, which looks not good. I think a crashing test is a normal case for the driver script, so changed it to debug.

I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive?
Comment 7 Xianzhu Wang 2012-07-26 14:37:11 PDT
(In reply to comment #6)
> I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive?

It is almost always the case on Android. The pipe almost always terminates before the 'adb shell' process when the test crashes. On Android with the _log.warning() (even with 'and not self.poll()') condition, every crash will output the message.

However, this is not a big issue, and I'd be glad to keep the warning if it's useful for other ports.
Comment 8 Dirk Pranke 2012-07-26 14:43:37 PDT
(In reply to comment #7)
> (In reply to comment #6)
> > I was thinking this might have to do with crashes; if this happens after a crash, I frankly don't care at all. Can you leave this as _log.warning() but only log it if the subprocess is still alive?
> 
> It is almost always the case on Android. The pipe almost always terminates before the 'adb shell' process when the test crashes. On Android with the _log.warning() (even with 'and not self.poll()') condition, every crash will output the message.
> 
> However, this is not a big issue, and I'd be glad to keep the warning if it's useful for other ports.

Sorry, maybe I wasn't clear. If we hit this condition, and the test didn't crash (on any port), this indicates a bug. If the test did crash, then I don't care about this condition at all (because I wouldn't expect there to be a #EOF). So can you leave this as a warning, but only print it if the test didn't crash?
Comment 9 Xianzhu Wang 2012-07-26 14:55:52 PDT
(In reply to comment #8)
> Sorry, maybe I wasn't clear.
> If we hit this condition, and the test didn't crash (on any port), this indicates a bug.

This is not true on Android, because the condition itself is our method to detect crash. The 'adb shell' subprocess will be almost always live when a test crashes. When we detect EOF, we know the crash and will kill the 'adb shell' subprocess. 

> If the test did crash, then I don't care about this condition at all (because I wouldn't expect there to be a #EOF). So can you leave this as a warning, but only print it if the test didn't crash?
Comment 10 Dirk Pranke 2012-07-26 15:07:28 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > Sorry, maybe I wasn't clear.
> > If we hit this condition, and the test didn't crash (on any port), this indicates a bug.
> 
> This is not true on Android, because the condition itself is our method to detect crash. The 'adb shell' subprocess will be almost always live when a test crashes. When we detect EOF, we know the crash and will kill the 'adb shell' subprocess. 


Oh, I see. Well, the warning indicates a bug on the other platforms, so either we need to provide a way to override this for android, or leave it alone :).
Comment 11 Xianzhu Wang 2012-07-26 15:50:34 PDT
Created attachment 154775 [details]
Patch
Comment 12 Dirk Pranke 2012-07-26 15:58:51 PDT
Comment on attachment 154775 [details]
Patch

much cleaner. thanks!
Comment 13 WebKit Review Bot 2012-07-26 20:33:36 PDT
Comment on attachment 154775 [details]
Patch

Rejecting attachment 154775 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2

Last 500 characters of output:
Kit/chromium/third_party/yasm/source/patched-yasm --revision 134927 --non-interactive --force --accept theirs-conflict --ignore-externals' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'
48>At revision 134927.

________ running '/usr/bin/python tools/clang/scripts/update.py --mac-only' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'

________ running '/usr/bin/python gyp_webkit' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'
Updating webkit projects from gyp files...

Full output: http://queues.webkit.org/results/13378119
Comment 14 Xianzhu Wang 2012-07-26 20:58:09 PDT
Comment on attachment 154775 [details]
Patch

OOPS! Won't CQ update the OOPS line? Try again.
Comment 15 WebKit Review Bot 2012-07-26 22:10:49 PDT
Comment on attachment 154775 [details]
Patch

Rejecting attachment 154775 [details] from commit-queue.

Failed to run "['/mnt/git/webkit-commit-queue/Tools/Scripts/webkit-patch', '--status-host=queues.webkit.org', '-..." exit_code: 2

Last 500 characters of output:
Kit/chromium/third_party/yasm/source/patched-yasm --revision 134927 --non-interactive --force --accept theirs-conflict --ignore-externals' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'
48>At revision 134927.

________ running '/usr/bin/python tools/clang/scripts/update.py --mac-only' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'

________ running '/usr/bin/python gyp_webkit' in '/mnt/git/webkit-commit-queue/Source/WebKit/chromium'
Updating webkit projects from gyp files...

Full output: http://queues.webkit.org/results/13373177
Comment 16 Adam Barth 2012-07-26 22:36:47 PDT
Comment on attachment 154775 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=154775&action=review

> Tools/ChangeLog:14
> +        Additional information of the change such as approach, rationale. Please add per-function descriptions below (OOPS!).

You need to remove this line yourself.
Comment 17 Xianzhu Wang 2012-07-27 09:04:51 PDT
Created attachment 154967 [details]
Patch
Comment 18 Xianzhu Wang 2012-07-27 09:05:40 PDT
(In reply to comment #16)
> (From update of attachment 154775 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=154775&action=review
> 
> > Tools/ChangeLog:14
> > +        Additional information of the change such as approach, rationale. Please add per-function descriptions below (OOPS!).
> 
> You need to remove this line yourself.

Oh, I missed that OOPS. I should have removed it when editing ChangeLog.
Comment 19 WebKit Review Bot 2012-07-27 11:42:53 PDT
Comment on attachment 154967 [details]
Patch

Clearing flags on attachment: 154967

Committed r123900: <http://trac.webkit.org/changeset/123900>
Comment 20 WebKit Review Bot 2012-07-27 11:42:58 PDT
All reviewed patches have been landed.  Closing bug.