96398 – [Master][Chromium] Bring layout tests on Android on par with other Chromium platforms

Peter Beverloo

Reported 2012-09-11 09:18:40 PDT

There are a lot of failing tests right now, many of which are Android-specific. We should triage them and work towards getting the bot green.

Xianzhu Wang

Comment 1 2012-09-11 10:07:54 PDT

Can we see the failures online?

Peter Beverloo

Comment 2 2012-09-11 10:10:06 PDT

I linked to them in the specific bugs. The results are available offline, the tests aren't.

Xianzhu Wang

Comment 3 2012-09-11 10:11:24 PDT

Never mind. I think I found the way :)

Peter Beverloo

Comment 4 2012-09-25 10:21:25 PDT

We're not in bad shape actually: http://peter.sh/files/layout-test-results/results.html In short: 1816 failing tests and 18 crashes. Most failures are due to: (1) Scrollbars (a) have a difference appearance and (b) overlay the content, and therefore don't take space. (2) Text selection color is a lighter shade of blue. (3) Android uses a lower quality image scaling algorithm, which in some cases is visually worse on desktop, but fine on mobile. As a plan of action for layout tests, I'd like to propose this: - Send an announcement to webkit-dev about 1816 new baselines being added, as it will cause quite some commit/update traffic. - Start uploading new baselines for all "failures" which actually are fine, i.e. fit in the three categories mentioned above. - Keep the bot red until we file bugs for the remaining failures and add them to TestExpectations. After TestWebKitAPI and webkit_unit_tests also work: - Integrate with garden-o-matic to keep us green. Does this seem reasonable? NB: The bot is crashy right now due to a JNI issue, which is fixed per http://codereview.chromium.org/10980016/. That just needs to roll into WebKit, which will certainly happen before the rebaselining.

Ojan Vafai

Comment 5 2012-09-25 10:32:02 PDT

This plan sounds good to me. You don't need to email webkit-dev though. Port-specific mass rebaselines happen all the time. Unless there were something unusual about this case, it's not email worthy.

Peter Beverloo

Comment 6 2012-09-25 12:13:33 PDT

Committed r129538: <http://trac.webkit.org/changeset/129538>

Peter Beverloo

Comment 7 2012-09-25 12:32:37 PDT

(In reply to comment #5) > This plan sounds good to me. You don't need to email webkit-dev though. Port-specific mass rebaselines happen all the time. Unless there were something unusual about this case, it's not email worthy. Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering. Because of the larger rendering area, .txt baselines were necessary as well. Following this first commit, we now have 1722 failing layout tests (minus the crashing ones). I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews. http://trac.webkit.org/changeset/129538 I'll work tomorrow to get us closer to 500, making sure that the tester doesn't abort early anymore :-). To confirm: yes, I'm checking every result manually.

Ojan Vafai

Comment 8 2012-09-25 13:30:59 PDT

Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio => Results: 1180/24298 tests passed (4.9%) You should setup a bot on the chromium webkit canary waterfall. We use much higher limits there before we bail early (e.g. 100 crashes). You'll need a bot there for garden-o-matic anyways. Also, then you could modify garden-o-matic locally so that you can use it for your android rebaselines. (In reply to comment #7) > Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering. > > I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews. Would be nice to include your one sentence description above about what the differences are. Consider someone a year from now trying to understand the history of why a given test has a given baseline. All they have to go on is ChangeLog entries and bug numbers. You don't need per-test descriptions for large rebaselines like this though. Also, we have a loose policy of only doing one patch per bug. It's not super-strict, but it confuses code reviews and the tooling. For unreviewed patches like this, it's OK to just not file a bug. See http://trac.webkit.org/changeset/128909 as an example. Finally, for any patches where you do hundreds of rebaselines, delete the list of changed files from the ChangeLog entry, as you did, is correct.

Peter Beverloo

Comment 9 2012-09-25 13:37:30 PDT

(In reply to comment #8) > Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio > => Results: 1180/24298 tests passed (4.9%) As said, that's because of a Chromium issue already fixed (http://codereview.chromium.org/10980016/). I'll roll DEPS tomorrow. > You should setup a bot on the chromium webkit canary waterfall. We use much higher limits there before we bail early (e.g. 100 crashes). You'll need a bot there for garden-o-matic anyways. Also, then you could modify garden-o-matic locally so that you can use it for your android rebaselines. Hardware has been requested a while ago but is not available yet. I'm aware of this limitation, and will address it as soon as the hardware is there. > (In reply to comment #7) > > Cool, cheers :-). I just landed a first baseline, for the css1/ directory. There weren't any differences except for the different width (for tests with scrollbars) and the scrollbar rendering. > > > > I assume it's fine to summarize the description as I did in the commit? It saves a lot of spam on the commit overviews. > > Would be nice to include your one sentence description above about what the differences are. Consider someone a year from now trying to understand the history of why a given test has a given baseline. All they have to go on is ChangeLog entries and bug numbers. You don't need per-test descriptions for large rebaselines like this though. > > Also, we have a loose policy of only doing one patch per bug. It's not super-strict, but it confuses code reviews and the tooling. > > For unreviewed patches like this, it's OK to just not file a bug. See http://trac.webkit.org/changeset/128909 as an example. I chose to divert from the policy as this bug contains that information, and there will be several commits with new baselines. Repeating the same sentence(s) with every batch seems redundant if we can group it here. Given that the patches aren't being uploaded to this bug either, would not mentioning it be preferred? The net difference is (a) a mention of the bug in the description, and (b) a comment such as comment #6. One thing I do plan on doing is separately mention baselines which look suspicious at first in the ChangeLog.

Ojan Vafai

Comment 10 2012-09-25 13:50:27 PDT

(In reply to comment #9) > (In reply to comment #8) > > Unfortunately, I have bad news for you. We bail early after 20 crashes, so we're only running a small percentage of the tests. Looking at http://build.webkit.org/builders/Chromium%20Android%20Release%20%28Tests%29/builds/770/steps/layout-test/logs/stdio > > => Results: 1180/24298 tests passed (4.9%) > > As said, that's because of a Chromium issue already fixed (http://codereview.chromium.org/10980016/). I'll roll DEPS tomorrow. I see. Missed that. > I chose to divert from the policy as this bug contains that information, and there will be several commits with new baselines. Repeating the same sentence(s) with every batch seems redundant if we can group it here. Given that the patches aren't being uploaded to this bug either, would not mentioning it be preferred? The net difference is (a) a mention of the bug in the description, and (b) a comment such as comment #6. Either one is fine as this is an unusual case.

Peter Beverloo

Comment 11 2012-09-26 11:26:51 PDT

Four commits with rebaselines landed today, bringing the amount of failing tests down to roughly 825. I've uploaded the latest complete results here: http://peter.sh/files/layout-test-results2/results.html http://trac.webkit.org/changeset/129618 http://trac.webkit.org/changeset/129619 http://trac.webkit.org/changeset/129642 http://trac.webkit.org/changeset/129659 The tester is now able to run 17,000 tests before aborting, which brings us significantly closer to doing a full run. Unfortunately, cycle time has gotten significantly worse as well: a run now takes 40 minutes, and will probably be an hour when we run everything. I'll be filing bugs for the issues I ran into today. Most notably, there are a lot of compositing failures, some color issues and we have very subtle differences in how we render buttons and select boxes.

Xianzhu Wang

Comment 12 2012-09-26 11:57:29 PDT

(In reply to comment #11) > The tester is now able to run 17,000 tests before aborting, which brings us significantly closer to doing a full run. Unfortunately, cycle time has gotten significantly worse as well: a run now takes 40 minutes, and will probably be an hour when we run everything. > How many devices are you using to run the layout tests? In downstream, multiple devices get good scalability for shortening the run time.

Peter Beverloo

Comment 13 2012-09-26 12:00:57 PDT

(In reply to comment #12) > How many devices are you using to run the layout tests? In downstream, multiple devices get good scalability for shortening the run time. Four Galaxy Nexus phones are attached. I imagine a large part of the slower runs is caused by us actually running pixel tests. I'll re-assess the bots' performances after they're able to run all of WebKit's tests. If this proves to be too slow then we'll just buy new devices, that's not a problem.

Xianzhu Wang

Comment 14 2012-09-26 12:05:38 PDT

(In reply to comment #13) > Four Galaxy Nexus phones are attached. I imagine a large part of the slower runs is caused by us actually running pixel tests. > I think so. We don't run pixel tests in downstream.

Peter Beverloo

Comment 15 2013-04-08 11:06:03 PDT

Resolving as WontFix given that Chromium moved to Blink.