Bug 96398 - [Master][Chromium] Bring layout tests on Android on par with other Chromium platforms
Summary: [Master][Chromium] Bring layout tests on Android on par with other Chromium p...
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Peter Beverloo
URL:
Keywords:
Depends on: 96382 96399 96492 97261 98647 98762 98763 98765 98766 98767 98768 98769 98770 98771 98772 98773 98774 98775 98776 98778 98779 98780 98781 98782 98783 98815 98816 98900 98901 98902 101330
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-11 09:18 PDT by Peter Beverloo
Modified: 2013-04-08 11:06 PDT (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Beverloo 2012-09-11 09:18:40 PDT
There are a lot of failing tests right now, many of which are Android-specific. We should triage them and work towards getting the bot green.
Comment 1 Xianzhu Wang 2012-09-11 10:07:54 PDT
Can we see the failures online?
Comment 2 Peter Beverloo 2012-09-11 10:10:06 PDT
I linked to them in the specific bugs. The results are available offline, the tests aren't.
Comment 3 Xianzhu Wang 2012-09-11 10:11:24 PDT
Never mind. I think I found the way :)
Comment 4 Peter Beverloo 2012-09-25 10:21:25 PDT
We're not in bad shape actually:
http://peter.sh/files/layout-test-results/results.html

In short: 1816 failing tests and 18 crashes. Most failures are due to:

(1) Scrollbars (a) have a difference appearance and (b) overlay the content, and therefore don't take space.
(2) Text selection color is a lighter shade of blue.
(3) Android uses a lower quality image scaling algorithm, which in some cases is visually worse on desktop, but fine on mobile.

As a plan of action for layout tests, I'd like to propose this:

- Send an announcement to webkit-dev about 1816 new baselines being added, as it will cause quite some commit/update traffic.
- Start uploading new baselines for all "failures" which actually are fine, i.e. fit in the three categories mentioned above.
- Keep the bot red until we file bugs for the remaining failures and add them to TestExpectations.

After TestWebKitAPI and webkit_unit_tests also work:
- Integrate with garden-o-matic to keep us green.

Does this seem reasonable?


NB: The bot is crashy right now due to a JNI issue, which is fixed per http://codereview.chromium.org/10980016/. That just needs to roll into WebKit, which will certainly happen before the rebaselining.
Comment 5 Ojan Vafai 2012-09-25 10:32:02 PDT
This plan sounds good to me. You don't need to email webkit-dev though. Port-specific mass rebaselines happen all the time. Unless there were something unusual about this case, it's not email worthy.
Comment 6 Peter Beverloo 2012-09-25 12:13:33 PDT
Committed r129538: <http://trac.webkit.org/changeset/129538>
Comment 7 Peter Beverloo 2012-09-25 12:32:37 PDT
(In reply to comment #5)
> This plan sounds good to me. You don't need to email webkit-dev though. Port-specific mass rebaselines happen all the time. Unless there were something unusual about this case, it's not email worthy.

Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering.

Because of the larger rendering area, .txt baselines were necessary as well. Following this first commit, we now have 1722 failing layout tests (minus the crashing ones).

I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews.

http://trac.webkit.org/changeset/129538

I'll work tomorrow to get us closer to 500, making sure that the tester doesn't abort early anymore :-). To confirm: yes, I'm checking every result manually.
Comment 8 Ojan Vafai 2012-09-25 13:30:59 PDT
Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio
=> Results: 1180/24298 tests passed (4.9%)

You should setup a bot on the chromium webkit canary waterfall. We use much higher limits there before we bail early (e.g. 100 crashes). You'll need a bot there for garden-o-matic anyways. Also, then you could modify garden-o-matic locally so that you can use it for your android rebaselines.

(In reply to comment #7)
> Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering.
>
> I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews.

Would be nice to include your one sentence description above about what the differences are. Consider someone a year from now trying to understand the history of why a given test has a given baseline. All they have to go on is ChangeLog entries and bug numbers. You don't need per-test descriptions for large rebaselines like this though. 

Also, we have a loose policy of only doing one patch per bug. It's not super-strict, but it confuses code reviews and the tooling.

For unreviewed patches like this, it's OK to just not file a bug. See http://trac.webkit.org/changeset/128909 as an example.

Finally, for any patches where you do hundreds of rebaselines, delete the list of changed files from the ChangeLog entry, as you did, is correct.
Comment 9 Peter Beverloo 2012-09-25 13:37:30 PDT
(In reply to comment #8)
> Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio
> => Results: 1180/24298 tests passed (4.9%)

As said, that's because of a Chromium issue already fixed (http://codereview.chromium.org/10980016/). I'll roll DEPS tomorrow.
 
> You should setup a bot on the chromium webkit canary waterfall. We use much higher limits there before we bail early (e.g. 100 crashes). You'll need a bot there for garden-o-matic anyways. Also, then you could modify garden-o-matic locally so that you can use it for your android rebaselines.

Hardware has been requested a while ago but is not available yet. I'm aware of this limitation, and will address it as soon as the hardware is there.
 
> (In reply to comment #7)
> > Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering.
> >
> > I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews.
> 
> Would be nice to include your one sentence description above about what the differences are. Consider someone a year from now trying to understand the history of why a given test has a given baseline. All they have to go on is ChangeLog entries and bug numbers. You don't need per-test descriptions for large rebaselines like this though.
> 
> Also, we have a loose policy of only doing one patch per bug. It's not super-strict, but it confuses code reviews and the tooling.
> 
> For unreviewed patches like this, it's OK to just not file a bug. See http://trac.webkit.org/changeset/128909 as an example.

I chose to divert from the policy as this bug contains that information, and there will be several commits with new baselines. Repeating the same sentence(s) with every batch seems redundant if we can group it here. Given that the patches aren't being uploaded to this bug either, would not mentioning it be preferred? The net difference is (a) a mention of the bug in the description, and (b) a comment such as comment #6.

One thing I do plan on doing is separately mention baselines which look suspicious at first in the ChangeLog.
Comment 10 Ojan Vafai 2012-09-25 13:50:27 PDT
(In reply to comment #9)
> (In reply to comment #8)
> > Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio
> > => Results: 1180/24298 tests passed (4.9%)
> 
> As said, that's because of a Chromium issue already fixed (http://codereview.chromium.org/10980016/). I'll roll DEPS tomorrow.

I see. Missed that.

> I chose to divert from the policy as this bug contains that information, and there will be several commits with new baselines. Repeating the same sentence(s) with every batch seems redundant if we can group it here. Given that the patches aren't being uploaded to this bug either, would not mentioning it be preferred? The net difference is (a) a mention of the bug in the description, and (b) a comment such as comment #6.

Either one is fine as this is an unusual case.
Comment 11 Peter Beverloo 2012-09-26 11:26:51 PDT
Four commits with rebaselines landed today, bringing the amount of failing tests down to roughly 825. I've uploaded the latest complete results here:
http://peter.sh/files/layout-test-results2/results.html

http://trac.webkit.org/changeset/129618
http://trac.webkit.org/changeset/129619
http://trac.webkit.org/changeset/129642
http://trac.webkit.org/changeset/129659

The tester is now able to run 17,000 tests before aborting, which brings us significantly closer to doing a full run. Unfortunately, cycle time has gotten significantly worse as well: a run now takes 40 minutes, and will probably be an hour when we run everything.

I'll be filing bugs for the issues I ran into today. Most notably, there are a lot of compositing failures, some color issues and we have very subtle differences in how we render buttons and select boxes.
Comment 12 Xianzhu Wang 2012-09-26 11:57:29 PDT
(In reply to comment #11)
> The tester is now able to run 17,000 tests before aborting, which brings us significantly closer to doing a full run. Unfortunately, cycle time has gotten significantly worse as well: a run now takes 40 minutes, and will probably be an hour when we run everything.
> 

How many devices are you using to run the layout tests? In downstream, multiple devices get good scalability for shortening the run time.
Comment 13 Peter Beverloo 2012-09-26 12:00:57 PDT
(In reply to comment #12)
> How many devices are you using to run the layout tests? In downstream, multiple devices get good scalability for shortening the run time.

Four Galaxy Nexus phones are attached. I imagine a large part of the slower runs is caused by us actually running pixel tests.

I'll re-assess the bots' performances after they're able to run all of WebKit's tests. If this proves to be too slow then we'll just buy new devices, that's not a problem.
Comment 14 Xianzhu Wang 2012-09-26 12:05:38 PDT
(In reply to comment #13)
> Four Galaxy Nexus phones are attached. I imagine a large part of the slower runs is caused by us actually running pixel tests.
>

I think so. We don't run pixel tests in downstream.
Comment 15 Peter Beverloo 2013-04-08 11:06:03 PDT
Resolving as WontFix given that Chromium moved to Blink.