Bug 83076

Summary: [Chromium] Lots of timeouts causing Mac10.6 to exit early.
Product: WebKit Reporter: Ojan Vafai <ojan>
Component: Tools / TestsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED WORKSFORME    
Severity: Normal CC: abarth, apavlov, dglazkov, dpranke, hans, hclam, rniwa, tony, webkit.review.bot, zmo
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Attachments:
Description Flags
Patch
none
add Changelog, port same logic to apple mac none

Description Ojan Vafai 2012-04-03 14:49:33 PDT
http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/14439/steps/webkit_tests/logs/stdio
Exiting early after 0 crashes and 20 timeouts. 21784 tests run.

Regressions: Unexpected tests timed out : (20)
  animations/cross-fade-background-image.html = TIMEOUT
  compositing/geometry/empty-embed-rects.html = TIMEOUT
  compositing/self-painting-layers.html = TIMEOUT
  compositing/transitions/scale-transition-no-start.html = TIMEOUT
  css1/basic/class_as_selector.html = TIMEOUT
  css1/box_properties/acid_test.html = TIMEOUT
  css1/cascade/cascade_order.html = TIMEOUT
  css1/classification/display.html = TIMEOUT
  css1/color_and_background/background.html = TIMEOUT
  css1/conformance/forward_compatible_parsing.html = TIMEOUT
  css1/font_properties/font.html = TIMEOUT
  css1/pseudo/anchor.html = TIMEOUT
  fast/forms/search-rtl.html = TIMEOUT
  fast/images/embed-does-not-propagate-dimensions-to-object-ancestor.html = TIMEOUT
  fast/loader/local-CSS-from-local.html = TIMEOUT
  fast/table/invisible-cell-background.html = TIMEOUT
  fast/text/international/plane2.html = TIMEOUT
  fast/text/justify-ideograph-complex.html = TIMEOUT
  fast/workers/storage/interrupt-database.html = TIMEOUT
  http/tests/appcache/remove-cache.html = TIMEOUT
Comment 1 Dirk Pranke 2012-04-03 15:01:34 PDT
I'm on this one ...
Comment 2 Dirk Pranke 2012-04-03 18:21:47 PDT
Created attachment 135477 [details]
Patch
Comment 3 Eric Seidel (no email) 2012-04-03 18:23:43 PDT
Comment on attachment 135477 [details]
Patch

Bleh.  Why not just up the amount of ram we expect child processes to take?  That seems like a less gross hack.
Comment 4 Dirk Pranke 2012-04-03 18:29:52 PDT
Created attachment 135478 [details]
add Changelog, port same logic to apple mac
Comment 5 Dirk Pranke 2012-04-03 18:32:54 PDT
(In reply to comment #3)
> (From update of attachment 135477 [details])
> Bleh.  Why not just up the amount of ram we expect child processes to take?  That seems like a less gross hack.

At the moment, at least on the Chromium SL bot, it doesn't look ram-related. It looks like we're thrashing on something else, but have plenty of RAM free.
Comment 6 Dirk Pranke 2012-04-03 18:35:01 PDT
I'm going to land this as-is, so that I can get the bot back online and we can get more data. Unfortunately, it's been flaky and aborting early for so long that I can't easily reproduce things or figure debug it (I've tried rolling back builds on that bot and run into a sordid list of issues that is stopping me that I need to work through in parallel).
Comment 7 Dirk Pranke 2012-04-03 18:35:24 PDT
I'll be happy to roll this out if there are other issues or if we really think this is the wrong thing to do.
Comment 8 Dirk Pranke 2012-04-03 18:40:54 PDT
Committed r113122: <http://trac.webkit.org/changeset/113122>
Comment 9 Dirk Pranke 2012-04-03 18:41:27 PDT
re-opening, I don't consider this fixed yet.
Comment 10 Dirk Pranke 2012-05-03 18:21:12 PDT
Note that we're seeing this quite a bit lately, even after the patch (see, e.g., http://build.chromium.org/p/chromium.webkit/waterfall?builder=Webkit%20Mac10.6&last_time=1336069767 ) ... it's possible that r115490 has made things worse, but I don't know what else might be contributing.
Comment 11 Dirk Pranke 2012-05-04 11:58:19 PDT
It seems like we're frequently seeing many of the same tests timing out this week, so I'm going to start marking them as flaky timeouts here and we'll see if this contains the problem, or if we're seeing systemic flakiness.

Here's the first batch:

compositing/geometry/outline-change.html

css3/selectors3/xml/css3-modsel-161.xml
css3/selectors3/xml/css3-modsel-166.xml
css3/selectors3/xml/css3-modsel-166a.xml

editing/deleting/delete-3857753-fix.html
editing/deleting/delete-3865854-fix.html
editing/deleting/delete-3928305-fix.html
editing/execCommand/4747450.html
editing/execCommand/4786404-1.html
editing/execCommand/4786404-2.html
editing/execCommand/4916235.html
editing/input/caret-at-the-edge-of-input.html
editing/execCommand/format-block-with-trailing-br.html
editing/execCommand/format-block-without-body-crash.html
editing/execCommand/format-block.html
editing/execCommand/forward-delete-no-scroll.html
editing/execCommand/hilitecolor.html
editing/input/emacs-ctrl-o.html
editing/input/div-first-child-rule-input.html
editing/input/div-first-child-rule-textarea.html
editing/input/ime-composition-clearpreedit.html
editing/input/insert-wrapping-space-in-textarea.html
editing/input/option-page-up-down.html
editing/input/page-up-down-scrolls.html
editing/inserting/12882.html
editing/inserting/4278698.html

http/tests/history/back-with-fragment-change.php
http/tests/history/cross-origin-replace-history-object.html
http/tests/history/history-navigations-set-referrer.html
http/tests/history/popstate-fires-with-pending-requests.html
http/tests/history/redirect-200-refresh-0-seconds.pl
http/tests/history/redirect-200-refresh-2-seconds.pl
http/tests/history/redirect-301.html
Comment 12 Dirk Pranke 2012-05-04 12:00:12 PDT
rniwa - it looks like maybe these editing tests started being flaky earlier this week. Can you take a look?
Comment 13 Ryosuke Niwa 2012-05-04 12:05:41 PDT
Are you sure they're really timing out? Aren't they just slow?

I don't see any changes that can cause things to timeout:
http://trac.webkit.org/log/trunk/Source/WebCore/editing
Comment 14 Dirk Pranke 2012-05-04 12:11:09 PDT
(In reply to comment #13)
> Are you sure they're really timing out? Aren't they just slow?

Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine.

> 
> I don't see any changes that can cause things to timeout:
> http://trac.webkit.org/log/trunk/Source/WebCore/editing

Yeah, I didn't either, but I don't tend to like to mark tests as slow unless I'm familiar with them and would expect them to take a while to run.
Comment 15 Ryosuke Niwa 2012-05-04 12:12:18 PDT
(In reply to comment #14)
> (In reply to comment #13)
> > Are you sure they're really timing out? Aren't they just slow?
> 
> Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine.

I don't mind marking the entire "editing" directory as "slow" for that matter. Many of editing tests are integration tests and take a long time to run.
Comment 16 Dirk Pranke 2012-05-04 12:14:29 PDT
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #13)
> > > Are you sure they're really timing out? Aren't they just slow?
> > 
> > Well, by definition they're timing out, but it could be because they're slow and should just be marked as slow :). If you think we should try marking them as slow instead that's fine.
> 
> I don't mind marking the entire "editing" directory as "slow" for that matter. Many of editing tests are integration tests and take a long time to run.

Okay, I'll update the expectations for editing tests. Thanks!
Comment 17 Dirk Pranke 2012-05-04 12:45:11 PDT
Here's some more ... I'm not filled with confidence in this approach:


  fast/workers/storage/multiple-databases-garbage-collection.html = TIMEOUT
  fast/workers/storage/multiple-transactions-on-different-handles-sync.html = TIMEOUT
  http/tests/history/redirect-302.html = TIMEOUT
  http/tests/history/redirect-303.html = TIMEOUT
  http/tests/misc/object-embedding-svg-delayed-size-negotiation.xhtml = TIMEOUT
  platform/chromium/virtual/gpu/canvas/philip/tests/2d.text-custom-font-load-crash.html = TIMEOUT
  platform/chromium/virtual/gpu/fast/canvas/2d.text.draw.fill.maxWidth.gradient.html = TIMEOUT
Comment 18 Ryosuke Niwa 2012-05-04 12:53:54 PDT
Maybe something in webkitpy is affecting the timing?
Comment 19 Dirk Pranke 2012-05-04 12:59:45 PDT
(In reply to comment #18)
> Maybe something in webkitpy is affecting the timing?

It's possible, but I don't know what it would be. I will probably let this approach go for the afternoon or so to get more data on the flakiness, and if it doesn't clear up I will try going back to --test-shell mode.

As I've noted elsewhere, one aspect of using DRT mode is that NRWT itself enforces the timeout and kills DRT when the test times out; maybe this is leaving something in an unhappy state w/ the O/S, or we're leaving things locked somewhere, and that's causing things to go downhill.
Comment 20 Tony Chang 2012-05-04 13:32:25 PDT
I feel like we're playing whack-a-mole and even if we find the culprit, tests we mark as timeout/slow now will be forgotten.

I would feel better about reverting changes until the bots improve.  Once the bots improve, we can reland patches (maybe with speculative fixes) to isolate the cause.  I.e., I would handle unknown flakiness the same way we handle perf regressions.
Comment 21 Dirk Pranke 2012-05-04 13:35:58 PDT
(In reply to comment #20)
> I feel like we're playing whack-a-mole and even if we find the culprit, tests we mark as timeout/slow now will be forgotten.
>

This is a valid concern.
 
> I would feel better about reverting changes until the bots improve.  Once the bots improve, we can reland patches (maybe with speculative fixes) to isolate the cause.  I.e., I would handle unknown flakiness the same way we handle perf regressions.

Apart from the one python change -- which I'm already planning to revert to see if it help -- any suggestions for what other changes to revert?
Comment 22 Dirk Pranke 2012-05-04 13:46:12 PDT
Okay, I've switched back to "test shell" mode on SL in http://trac.webkit.org/changeset/116161 . Let's see what happens now.
Comment 23 Tony Chang 2012-05-04 14:17:47 PDT
Looking at the waterfall, it looks like the set of failing tests isn't at all consistent.  I doubt adding suppressions will green the tree.

Here's the first set of timeouts I see. It's from the beginning of Wednesday.
http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/15522/steps/webkit_tests/logs/stdio

But zmo said the flakiness started earlier, maybe last Friday?  Here are NRWT changes that touch NRWT code around that time:

115377
115452
115490
115729?

None of the changes look that suspect, but I don't know of any other way to determine the cause of the regression.
Comment 24 Dirk Pranke 2012-05-04 14:28:18 PDT
When I(In reply to comment #23)
> Looking at the waterfall, it looks like the set of failing tests isn't at all consistent.  I doubt adding suppressions will green the tree.
> 
> Here's the first set of timeouts I see. It's from the beginning of Wednesday.
> http://build.chromium.org/p/chromium.webkit/builders/Webkit%20Mac10.6/builds/15522/steps/webkit_tests/logs/stdio
> 

There are definitely timeouts earlier, e.g.: http://build.chromium.org/p/chromium.webkit/waterfall?last_time=1335833009&show=Webkit%20Mac10.6

> But zmo said the flakiness started earlier, maybe last Friday?  Here are NRWT changes that touch NRWT code around that time:
> 
> 115377
> 115452
> 115490
> 115729?
> 
> None of the changes look that suspect, but I don't know of any other way to determine the cause of the regression.

Well, 115490 is definitely suspicious (and already disabled, so now we're just waiting). You can see a marked uptick in flakiness in the first build after that changes:

http://build.chromium.org/p/chromium.webkit/waterfall?force=true&last_time=1335569069&show=Webkit%20Mac10.6 (see build 15326, in particular).
Comment 25 Dirk Pranke 2012-05-18 15:10:43 PDT
closing this as WORKSFORME (the status is debatable; it probably could be WONTFIX or FIXED as well).

For whatever reason, our old Xserves appear to be flaky in the release build. Since we haven't seen this issue anywhere else, and we're migrated off of the Xserves, we're gonna ignore this.