Bug 83076

Summary: [Chromium] Lots of timeouts causing Mac10.6 to exit early.
Product: WebKit Reporter: Ojan Vafai <ojan>
Component: Tools / TestsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED WORKSFORME    
Severity: Normal CC: abarth, apavlov, dglazkov, dpranke, hans, hclam, rniwa, tony, webkit.review.bot, zmo
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
add Changelog, port same logic to apple mac none

Ojan Vafai
Reported 2012-04-03 14:49:33 PDT
http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/14439/steps/webkit_tests/logs/stdio Exiting early after 0 crashes and 20 timeouts. 21784 tests run. Regressions: Unexpected tests timed out : (20) animations/cross-fade-background-image.html = TIMEOUT compositing/geometry/empty-embed-rects.html = TIMEOUT compositing/self-painting-layers.html = TIMEOUT compositing/transitions/scale-transition-no-start.html = TIMEOUT css1/basic/class_as_selector.html = TIMEOUT css1/box_properties/acid_test.html = TIMEOUT css1/cascade/cascade_order.html = TIMEOUT css1/classification/display.html = TIMEOUT css1/color_and_background/background.html = TIMEOUT css1/conformance/forward_compatible_parsing.html = TIMEOUT css1/font_properties/font.html = TIMEOUT css1/pseudo/anchor.html = TIMEOUT fast/forms/search-rtl.html = TIMEOUT fast/images/embed-does-not-propagate-dimensions-to-object-ancestor.html = TIMEOUT fast/loader/local-CSS-from-local.html = TIMEOUT fast/table/invisible-cell-background.html = TIMEOUT fast/text/international/plane2.html = TIMEOUT fast/text/justify-ideograph-complex.html = TIMEOUT fast/workers/storage/interrupt-database.html = TIMEOUT http/tests/appcache/remove-cache.html = TIMEOUT
Attachments
Patch (1.71 KB, patch)
2012-04-03 18:21 PDT, Dirk Pranke
no flags
add Changelog, port same logic to apple mac (3.41 KB, patch)
2012-04-03 18:29 PDT, Dirk Pranke
no flags
Dirk Pranke
Comment 1 2012-04-03 15:01:34 PDT
I'm on this one ...
Dirk Pranke
Comment 2 2012-04-03 18:21:47 PDT
Eric Seidel (no email)
Comment 3 2012-04-03 18:23:43 PDT
Comment on attachment 135477 [details] Patch Bleh. Why not just up the amount of ram we expect child processes to take? That seems like a less gross hack.
Dirk Pranke
Comment 4 2012-04-03 18:29:52 PDT
Created attachment 135478 [details] add Changelog, port same logic to apple mac
Dirk Pranke
Comment 5 2012-04-03 18:32:54 PDT
(In reply to comment #3) > (From update of attachment 135477 [details]) > Bleh. Why not just up the amount of ram we expect child processes to take? That seems like a less gross hack. At the moment, at least on the Chromium SL bot, it doesn't look ram-related. It looks like we're thrashing on something else, but have plenty of RAM free.
Dirk Pranke
Comment 6 2012-04-03 18:35:01 PDT
I'm going to land this as-is, so that I can get the bot back online and we can get more data. Unfortunately, it's been flaky and aborting early for so long that I can't easily reproduce things or figure debug it (I've tried rolling back builds on that bot and run into a sordid list of issues that is stopping me that I need to work through in parallel).
Dirk Pranke
Comment 7 2012-04-03 18:35:24 PDT
I'll be happy to roll this out if there are other issues or if we really think this is the wrong thing to do.
Dirk Pranke
Comment 8 2012-04-03 18:40:54 PDT
Dirk Pranke
Comment 9 2012-04-03 18:41:27 PDT
re-opening, I don't consider this fixed yet.
Dirk Pranke
Comment 10 2012-05-03 18:21:12 PDT
Note that we're seeing this quite a bit lately, even after the patch (see, e.g., http://build.chromium.org/p/chromium.webkit/waterfall?builder=Webkit%20Mac10.6&last_time=1336069767 ) ... it's possible that r115490 has made things worse, but I don't know what else might be contributing.
Dirk Pranke
Comment 11 2012-05-04 11:58:19 PDT
It seems like we're frequently seeing many of the same tests timing out this week, so I'm going to start marking them as flaky timeouts here and we'll see if this contains the problem, or if we're seeing systemic flakiness. Here's the first batch: compositing/geometry/outline-change.html css3/selectors3/xml/css3-modsel-161.xml css3/selectors3/xml/css3-modsel-166.xml css3/selectors3/xml/css3-modsel-166a.xml editing/deleting/delete-3857753-fix.html editing/deleting/delete-3865854-fix.html editing/deleting/delete-3928305-fix.html editing/execCommand/4747450.html editing/execCommand/4786404-1.html editing/execCommand/4786404-2.html editing/execCommand/4916235.html editing/input/caret-at-the-edge-of-input.html editing/execCommand/format-block-with-trailing-br.html editing/execCommand/format-block-without-body-crash.html editing/execCommand/format-block.html editing/execCommand/forward-delete-no-scroll.html editing/execCommand/hilitecolor.html editing/input/emacs-ctrl-o.html editing/input/div-first-child-rule-input.html editing/input/div-first-child-rule-textarea.html editing/input/ime-composition-clearpreedit.html editing/input/insert-wrapping-space-in-textarea.html editing/input/option-page-up-down.html editing/input/page-up-down-scrolls.html editing/inserting/12882.html editing/inserting/4278698.html http/tests/history/back-with-fragment-change.php http/tests/history/cross-origin-replace-history-object.html http/tests/history/history-navigations-set-referrer.html http/tests/history/popstate-fires-with-pending-requests.html http/tests/history/redirect-200-refresh-0-seconds.pl http/tests/history/redirect-200-refresh-2-seconds.pl http/tests/history/redirect-301.html
Dirk Pranke
Comment 12 2012-05-04 12:00:12 PDT
rniwa - it looks like maybe these editing tests started being flaky earlier this week. Can you take a look?
Ryosuke Niwa
Comment 13 2012-05-04 12:05:41 PDT
Are you sure they're really timing out? Aren't they just slow? I don't see any changes that can cause things to timeout: http://trac.webkit.org/log/trunk/Source/WebCore/editing
Dirk Pranke
Comment 14 2012-05-04 12:11:09 PDT
(In reply to comment #13) > Are you sure they're really timing out? Aren't they just slow? Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine. > > I don't see any changes that can cause things to timeout: > http://trac.webkit.org/log/trunk/Source/WebCore/editing Yeah, I didn't either, but I don't tend to like to mark tests as slow unless I'm familiar with them and would expect them to take a while to run.
Ryosuke Niwa
Comment 15 2012-05-04 12:12:18 PDT
(In reply to comment #14) > (In reply to comment #13) > > Are you sure they're really timing out? Aren't they just slow? > > Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine. I don't mind marking the entire "editing" directory as "slow" for that matter. Many of editing tests are integration tests and take a long time to run.
Dirk Pranke
Comment 16 2012-05-04 12:14:29 PDT
(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #13) > > > Are you sure they're really timing out? Aren't they just slow? > > > > Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine. > > I don't mind marking the entire "editing" directory as "slow" for that matter. Many of editing tests are integration tests and take a long time to run. Okay, I'll update the expectations for editing tests. Thanks!
Dirk Pranke
Comment 17 2012-05-04 12:45:11 PDT
Here's some more ... I'm not filled with confidence in this approach: fast/workers/storage/multiple-databases-garbage-collection.html = TIMEOUT fast/workers/storage/multiple-transactions-on-different-handles-sync.html = TIMEOUT http/tests/history/redirect-302.html = TIMEOUT http/tests/history/redirect-303.html = TIMEOUT http/tests/misc/object-embedding-svg-delayed-size-negotiation.xhtml = TIMEOUT platform/chromium/virtual/gpu/canvas/philip/tests/2d.text-custom-font-load-crash.html = TIMEOUT platform/chromium/virtual/gpu/fast/canvas/2d.text.draw.fill.maxWidth.gradient.html = TIMEOUT
Ryosuke Niwa
Comment 18 2012-05-04 12:53:54 PDT
Maybe something in webkitpy is affecting the timing?
Dirk Pranke
Comment 19 2012-05-04 12:59:45 PDT
(In reply to comment #18) > Maybe something in webkitpy is affecting the timing? It's possible, but I don't know what it would be. I will probably let this approach go for the afternoon or so to get more data on the flakiness, and if it doesn't clear up I will try going back to --test-shell mode. As I've noted elsewhere, one aspect of using DRT mode is that NRWT itself enforces the timeout and kills DRT when the test times out; maybe this is leaving something in an unhappy state w/ the O/S, or we're leaving things locked somewhere, and that's causing things to go downhill.
Tony Chang
Comment 20 2012-05-04 13:32:25 PDT
I feel like we're playing whack-a-mole and even if we find the culprit, tests we mark as timeout/slow now will be forgotten. I would feel better about reverting changes until the bots improve. Once the bots improve, we can reland patches (maybe with speculative fixes) to isolate the cause. I.e., I would handle unknown flakiness the same way we handle perf regressions.
Dirk Pranke
Comment 21 2012-05-04 13:35:58 PDT
(In reply to comment #20) > I feel like we're playing whack-a-mole and even if we find the culprit, tests we mark as timeout/slow now will be forgotten. > This is a valid concern. > I would feel better about reverting changes until the bots improve. Once the bots improve, we can reland patches (maybe with speculative fixes) to isolate the cause. I.e., I would handle unknown flakiness the same way we handle perf regressions. Apart from the one python change -- which I'm already planning to revert to see if it help -- any suggestions for what other changes to revert?
Dirk Pranke
Comment 22 2012-05-04 13:46:12 PDT
Okay, I've switched back to "test shell" mode on SL in http://trac.webkit.org/changeset/116161 . Let's see what happens now.
Tony Chang
Comment 23 2012-05-04 14:17:47 PDT
Looking at the waterfall, it looks like the set of failing tests isn't at all consistent. I doubt adding suppressions will green the tree. Here's the first set of timeouts I see. It's from the beginning of Wednesday. http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/15522/steps/webkit_tests/logs/stdio But zmo said the flakiness started earlier, maybe last Friday? Here are NRWT changes that touch NRWT code around that time: 115377 115452 115490 115729? None of the changes look that suspect, but I don't know of any other way to determine the cause of the regression.
Dirk Pranke
Comment 24 2012-05-04 14:28:18 PDT
When I(In reply to comment #23) > Looking at the waterfall, it looks like the set of failing tests isn't at all consistent. I doubt adding suppressions will green the tree. > > Here's the first set of timeouts I see. It's from the beginning of Wednesday. > http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/15522/steps/webkit_tests/logs/stdio > There are definitely timeouts earlier, e.g.: http://build.chromium.org/p/chromium.webkit/waterfall?last_time=1335833009&show=Webkit%20Mac10.6 > But zmo said the flakiness started earlier, maybe last Friday? Here are NRWT changes that touch NRWT code around that time: > > 115377 > 115452 > 115490 > 115729? > > None of the changes look that suspect, but I don't know of any other way to determine the cause of the regression. Well, 115490 is definitely suspicious (and already disabled, so now we're just waiting). You can see a marked uptick in flakiness in the first build after that changes: http://build.chromium.org/p/chromium.webkit/waterfall?force=true&last_time=1335569069&show=Webkit%20Mac10.6 (see build 15326, in particular).
Dirk Pranke
Comment 25 2012-05-18 15:10:43 PDT
closing this as WORKSFORME (the status is debatable; it probably could be WONTFIX or FIXED as well). For whatever reason, our old Xserves appear to be flaky in the release build. Since we haven't seen this issue anywhere else, and we're migrated off of the Xserves, we're gonna ignore this.
Note You need to log in before you can comment on or make changes to this bug.