Bug 127095 - run-webkit-tests should support assert-only js tests w/o expected.txt files
Summary: run-webkit-tests should support assert-only js tests w/o expected.txt files
Status: RESOLVED WONTFIX
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-01-16 03:06 PST by youenn fablet
Modified: 2015-07-31 09:23 PDT (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description youenn fablet 2014-01-16 03:06:36 PST
Consider merging Blink patches from issue 268711: https://code.google.com/p/chromium/issues/detail?id=268711
Issue description:
//////////////////////////////////////////
This comes up in the context of being able to import and run the W3C's tests, but is applicable to the existing tests in Blink as well.

Many of our JS-based (text-only) tests produce output of the form of lines starting with either "PASS" or "FAIL" (in addition to informational output). 

Programmatically, one can determine if a test is passing by simply asserting that there are no lines that start with "FAIL" in the output. In this case, the -expected.txt is kind of unnecessary. 

In the W3C tests' cases, -expected.txt files don't exist, which means that we need to generate and synthesize false baselines (and then review them) as part of the import. There's > 6000 JS-based tests in the W3C repos, so this is unfortunate :).

I think we should just modify run-webkit-tests to treat a test as passing if an -expected.txt file doesn't exist and the text output isn't a render tree, contains at least one "PASS", and doesn't contain any "FAIL" or "TIMEOUT" lines.

Note that the downside to such an approach is that you can't easily track partial failures (where you pass some assertions but fail others), but we can continue to use -expected.txt for those cases.

Note that of the existing ~21,000 text-only tests in blink, about 9200 of them are tests that contain one or more passing assertions and zero failing assertions. In other words, giving the above definition, we could delete ~9200 expected.txt's and not really miss anything.

Anyone object (to adding support, not deleting the existing -expected.txt, that's a separate issue)?

I've posted a patch indicating roughly how this would work here:

https://codereview.chromium.org/22352002/
//////////////////////////////////////////
Comment 1 Ryosuke Niwa 2014-01-16 03:26:01 PST
I don't think we should do this.  Even if an explicit FAIL wasn't there or all of PASS were present, it's possible for a test to spit out an erroneous line or not run a part of the test.  Having -expected.txt catches these bugs.
Comment 2 youenn fablet 2014-01-18 13:38:47 PST
(In reply to comment #1)
> I don't think we should do this.  Even if an explicit FAIL wasn't there or all of PASS were present, it's possible for a test to spit out an erroneous line or not run a part of the test.  Having -expected.txt catches these bugs.

I would restrict this approach to testharness.js based tests.
Probably to imported W3C tests initially, this is where it is mostly valuable.

This approach would limit the number of tests to manually check.
If this number is limited, the manual checking(and related expected.txt files) would probably be more reliable.

For those tests, the matching pattern would be something like:
- empty line(s)
- line(s) starting with PASS
- empty line(s)
Any test result that does not match this pattern would require a matching -expected file to pass. This should catch any output that contains spurious line.

To handle the case  of testharness.js based tests that would only partially run, the situation is less clear.
For existing webkit tests, we already have -expected.txt files.
For new webkit tests, a parameter could be added to identify the number of expected tests.
For existing W3C tests, I do not know what can be done on that. Manual checking may not be the best solution anyway. And more tests (even if partially run) seems better than fewer tests.

I would restrict this strategy to imported/w3c tests initially.
If that proves to work, we may envision to apply it to other testharness.js layout tests.
To identify which tests are testharness.js based, testharnessreport.js could be tweaked to ouput a line stating something like "testharness.js based test: ..."

Would that address your concerns?
Comment 3 Ryosuke Niwa 2014-01-18 14:08:58 PST
(In reply to comment #2)
> (In reply to comment #1)
> > I don't think we should do this.  Even if an explicit FAIL wasn't there or all of PASS were present, it's possible for a test to spit out an erroneous line or not run a part of the test.  Having -expected.txt catches these bugs.
> 
> I would restrict this approach to testharness.js based tests.
> Probably to imported W3C tests initially, this is where it is mostly valuable.
> 
> This approach would limit the number of tests to manually check.
> If this number is limited, the manual checking(and related expected.txt files) would probably be more reliable.
> 
> For those tests, the matching pattern would be something like:
> - empty line(s)
> - line(s) starting with PASS
> - empty line(s)
> Any test result that does not match this pattern would require a matching -expected file to pass. This should catch any output that contains spurious line.

That doesn't catch the case where one of the test cases didn't get to run.

> To handle the case  of testharness.js based tests that would only partially run, the situation is less clear.
> For existing webkit tests, we already have -expected.txt files.
> For new webkit tests, a parameter could be added to identify the number of expected tests.
> For existing W3C tests, I do not know what can be done on that. Manual checking may not be the best solution anyway. And more tests (even if partially run) seems better than fewer tests.

Checking in -expected.txt doesn't mean we can import less tests.  Also, we can import tests with FAIL expectations as well.  That's how we've been importing tests.

> I would restrict this strategy to imported/w3c tests initially.
> If that proves to work, we may envision to apply it to other testharness.js layout tests.
> To identify which tests are testharness.js based, testharnessreport.js could be tweaked to output a line stating something like "testharness.js based test: ..."

I still strongly object to this approach.
Comment 4 youenn fablet 2014-01-18 15:09:27 PST
> > To handle the case  of testharness.js based tests that would only partially run, the situation is less clear.
> > For existing webkit tests, we already have -expected.txt files.
> > For new webkit tests, a parameter could be added to identify the number of expected tests.
> > For existing W3C tests, I do not know what can be done on that. Manual checking may not be the best solution anyway. And more tests (even if partially run) seems better than fewer tests.
> 
> Checking in -expected.txt doesn't mean we can import less tests.  Also, we can import tests with FAIL expectations as well.  That's how we've been importing tests.

Agreed on FAIL expectations. -expected files should be kept for sure.

Checking in -expected.txt files does not guarantee that all tests are actually run though.
To guarantee that, one needs to carefully read each test.
This obviously takes time, hence less imported tests at the end of the day.

When importing tests, I would prefer adding the number of expected tests directly in the HTML file compared to counting the number of PASS in the expected file.
First, tests are more often read than expected files, giving more chances to catch errors in the number of tests.
Second, this may be pushed to the W3C repository.

Would you still object to the approach if the checking of the number of tests is made mandatory?
Comment 5 Ryosuke Niwa 2014-01-18 16:28:10 PST
(In reply to comment #4)
> > > To handle the case  of testharness.js based tests that would only partially run, the situation is less clear.
> > > For existing webkit tests, we already have -expected.txt files.
> > > For new webkit tests, a parameter could be added to identify the number of expected tests.
> > > For existing W3C tests, I do not know what can be done on that. Manual checking may not be the best solution anyway. And more tests (even if partially run) seems better than fewer tests.
> > 
> > Checking in -expected.txt doesn't mean we can import less tests.  Also, we can import tests with FAIL expectations as well.  That's how we've been importing tests.
> 
> Agreed on FAIL expectations. -expected files should be kept for sure.
> 
> Checking in -expected.txt files does not guarantee that all tests are actually run though.
> To guarantee that, one needs to carefully read each test.
> This obviously takes time, hence less imported tests at the end of the day.

We need to do this regardless.  Each test we import needs to be manually reviewed like any other test being added to WebKit.

> When importing tests, I would prefer adding the number of expected tests directly in the HTML file compared to counting the number of PASS in the expected file.

That sounds like an improvement.

> First, tests are more often read than expected files, giving more chances to catch errors in the number of tests.

I disagree. Whenever a test fails, what I'd like to see is the -expected.txt diff.  I don't want to manually examine the test to figure out what might be failing.

> Second, this may be pushed to the W3C repository.

Adding the number of test cases in each test would definitely need to be done in the W3C repository, not in WebKit.

> Would you still object to the approach if the checking of the number of tests is made mandatory?

I still don't think it's a good idea.
Comment 6 youenn fablet 2014-01-24 01:30:12 PST
The end-goal being to enable regular testing of web platform tests, an alternative may be to generate js tests baselines by disabling assert() in testharness.js.

The process would be:
- import the latest version of w3c tests in a imported/temp folder
- generate baselines by running all js tests from imported/temp with the assert-disabled testharness.js
- run all js tests with the regular testharness.js
- check results: lines in baselines that do not start with PASS, generated baselines with existing imported/w3c baselines...

We would get an overall picture of how aligned is webkit with w3c tests.
We would also check whether imported tests are still functionally in sync or not

Any thoughts?
Comment 7 Ryosuke Niwa 2014-01-24 01:34:41 PST
We should simply commit tests and -expected.txt files instead.
Comment 8 Ryosuke Niwa 2014-01-26 22:53:26 PST
Again, I don't think we can avoid generating lines just because they start with "PASS".
Comment 9 youenn fablet 2014-01-27 00:36:54 PST
(In reply to comment #8)
> Again, I don't think we can avoid generating lines just because they start with "PASS".

OK, it is probably safer to keep it the way it is for layout tests.

To help the task of importing W3C tests and ensuring that imported tests remain consistent with W3C repository, a complementary tool may be helpful though.

Such a tool (based on import-w3c-tests and run-webkit-tests) could regularly and automatically compute:
- Which tests have been imported and are in sync with W3C repository
- Which tests have been imported and are out of sync with W3C repository
  - How bad this affects the test result (unchanged, probably switched to passing, switched to failing…)
- Which tests remain to be imported (partially imported folder or newly added tests) and their state:
  - Tests that are crashing/hanging (good to fix!)
  - Tests that are failing (ideally with a link to a related bug entry if any)
  - Tests that are potentially passing
  - Tests that were potentially passing and are no longer passing

That said, this is clearly less a priority than doing the actual job of importing W3C tests :)
Comment 10 Ryosuke Niwa 2014-01-27 01:08:29 PST
I don't think we should merge the Blink patch.  Adding a tool to verify/get statistics on how many W3C tests we're actually passing seems like a good idea but that shouldn't be restricted by tests that use test harness.js   We should be able to automate that for ref tests as well so we should probably file a separate bug for that.
Comment 11 James Graham 2015-07-31 09:23:28 PDT
FWIW the way that Firefox deals with this is:

* Every test that has an expectation of some sort other than PASS gets a corresponding .ini file with details about the expected result
* All tests must match the expectations in their ini file or PASS, otherwise it's treated as an error.
* When new tests are imported these ini files are autogenerated from a test run using a known-good version of the code.

I strongly disagree with the premise that all imported tests must be given detailed review before they are used. In practice it's highly unlikely that anyone will actually do this work, so as a policy that's exactly equivalent to "we don't run W3C tests". It's also highly unlikely that such a review would catch a meaningful number of issues compared to the cost of doing it. With tests the most important consideration is "is this stable", which reviewers are rather bad at noticing, unless it's doing some very obviously bad things like using setTimeout with short timeouts. Other issues e.g. incorrectness are much easier to spot once the test is actually running.

I agree that the above policy theoretically allows bugs like "test X is expected to run but actually doesn't run". However in practice these are rare, so I am not that concerned. You can of course do better there if you are really worried about it.

I strongly urge that you adopt a process that allows almost-fully automatic imports of the tests from upstream. In Firefox the whole process looks something like:

./mach web-platform-tests-update --sync
git push try
fetchlogs.py try [sha1]
./mach web-platform-tests-update *.log
git rebase -i inbound/tip
git push

That has a couple of manual steps that prevent full automation, but updates are nevertheless trivial and as a consequence we are able to run 95% of the tests with a update latency of about a week.