Bug 38879 - new-run-webkit-tests: implement "--reset-results"
Summary: new-run-webkit-tests: implement "--reset-results"
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: New Bugs (show other bugs)
Version: 528+ (Nightly build)
Hardware: Other OS X 10.5
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-10 18:06 PDT by Dirk Pranke
Modified: 2010-05-18 17:52 PDT (History)
7 users (show)

See Also:


Attachments
Patch (20.28 KB, patch)
2010-05-10 18:07 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
cleaned-up patch, still no tests (9.00 KB, patch)
2010-05-11 14:22 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
Patch (14.94 KB, patch)
2010-05-11 16:03 PDT, Dirk Pranke
ojan: review-
Details | Formatted Diff | Diff
update w/ feedback from ojan, eseidel, abarth (20.26 KB, patch)
2010-05-17 16:18 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
change chromium expectation from WONTFIX SKIP = FAIL to WONTFIX = MISSING (20.26 KB, patch)
2010-05-17 16:43 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
add test for platform-specific baseline, refactor into two unit tests, one for each switch (20.11 KB, patch)
2010-05-18 15:37 PDT, Dirk Pranke
no flags Details | Formatted Diff | Diff
okay, use a separate layout_tests_dir under webkitpy/layout_tests/data to isolate unit test data from actual layout tests (34.71 KB, patch)
2010-05-18 17:11 PDT, Dirk Pranke
ojan: review+
Details | Formatted Diff | Diff
lint (35.03 KB, patch)
2010-05-18 17:13 PDT, Dirk Pranke
ojan: review+
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dirk Pranke 2010-05-10 18:06:19 PDT
new-run-webkit-tests: add an '--update-baseline' switch
Comment 1 Dirk Pranke 2010-05-10 18:07:19 PDT
Created attachment 55631 [details]
Patch
Comment 2 Dirk Pranke 2010-05-10 18:08:21 PDT
Note that the patch I just uploaded does not have any unit tests yet and so it isn't ready to land, but the change itself is complete.
Comment 3 Dirk Pranke 2010-05-11 14:22:08 PDT
Created attachment 55755 [details]
cleaned-up patch, still no tests
Comment 4 Dirk Pranke 2010-05-11 16:03:29 PDT
Created attachment 55772 [details]
Patch
Comment 5 Dirk Pranke 2010-05-11 16:04:15 PDT
Okay, the latest patch has unit tests and is ready for review.
Comment 6 Ojan Vafai 2010-05-13 16:20:20 PDT
Comment on attachment 55772 [details]
Patch

The code logic looks correct and I agree that this is necessary functionality. r- mostly just due to naming nits.
---------------------------------
http://wkrietveld.appspot.com/38879/diff/1/5
File WebKitTools/Scripts/webkitpy/layout_tests/port/test.py (left):

http://wkrietveld.appspot.com/38879/diff/1/5#oldcode141
WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:141: 
did you mean to delete these two blank lines?

http://wkrietveld.appspot.com/38879/diff/1/5
File WebKitTools/Scripts/webkitpy/layout_tests/port/test.py (right):

http://wkrietveld.appspot.com/38879/diff/1/5#newcode31
WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:31: from __future__ import with_statement
Is this just to satisfy the linter? __future__ is new to me.

http://wkrietveld.appspot.com/38879/diff/1/6
File WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py (right):

http://wkrietveld.appspot.com/38879/diff/1/6#newcode1527
WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527: "generated results"),
This description makes it sound like it will update every expected result for this test, when, in practice, it will update only the result that happens to be used on this platform.

--new-baseline and --update-baseline are too similar of names. We shouldn't need the help text to remember which is which. How about:

--new-baseline --> --update-platform-baseline
--update-baseline --> --update-used-baseline

It's a bit more wordy, but at least it's clear which is which. We could add short versions for easier typing (e.g. -b and -u?).

http://wkrietveld.appspot.com/38879/diff/1/7
File WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py (right):

http://wkrietveld.appspot.com/38879/diff/1/7#newcode106
WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:106: return original_open(name, mode, encoding)
Should we make sure the mode here is 'r' to avoid ever writing to disk?

http://wkrietveld.appspot.com/38879/diff/1/7#newcode124
WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:124: 'fast/canvas/canvas-zoom.html'])
Should we also test a case where --update-baseline would update the results in platform/mac?

http://wkrietveld.appspot.com/38879/diff/1/9
File WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py (right):

http://wkrietveld.appspot.com/38879/diff/1/9#newcode97
WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py:97: new_baseline=True):
docstring?

I don't like this argument name. It reads like this is the actual new baseline, not like a boolean. How about save_new_baseline?

Also, if we end up changing the name of the command-line flag, I think this would be clearer. This could be update_platform_baseline.
Comment 7 Adam Barth 2010-05-13 17:01:51 PDT
How does this differ from --reset-results?
Comment 8 Dirk Pranke 2010-05-13 18:07:35 PDT
Okay, I attempted to use rietveld, but how do you publish the comments?

Repeating them here in line ...

(In reply to comment #6)
> (From update of attachment 55772 [details])
> The code logic looks correct and I agree that this is necessary functionality. r- mostly just due to naming nits.
> ---------------------------------
> http://wkrietveld.appspot.com/38879/diff/1/5
> File WebKitTools/Scripts/webkitpy/layout_tests/port/test.py (left):
> 
> http://wkrietveld.appspot.com/38879/diff/1/5#oldcode141
> WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:141: 
> did you mean to delete these two blank lines?

No. good catch.

> 
> http://wkrietveld.appspot.com/38879/diff/1/5
> File WebKitTools/Scripts/webkitpy/layout_tests/port/test.py (right):
> 
> http://wkrietveld.appspot.com/38879/diff/1/5#newcode31
> WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:31: from __future__ import with_statement
> Is this just to satisfy the linter? __future__ is new to me.
> 

You need "from __future__" to be able to use the "with" statement in Python 2.5.

> http://wkrietveld.appspot.com/38879/diff/1/6
> File WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py (right):
> 
> http://wkrietveld.appspot.com/38879/diff/1/6#newcode1527
> WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527: "generated results"),
> This description makes it sound like it will update every expected result for this test, when, in practice, it will update only the result that happens to be used on this platform.
> 
> --new-baseline and --update-baseline are too similar of names. We shouldn't need the help text to remember which is which. How about:
> 
> --new-baseline --> --update-platform-baseline
> --update-baseline --> --update-used-baseline
> 
> It's a bit more wordy, but at least it's clear which is which. We could add short versions for easier typing (e.g. -b and -u?).

I will take another spin at the names and the help text. I don't think these features are used enough to justify short versions.

> 
> http://wkrietveld.appspot.com/38879/diff/1/7
> File WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py (right):
> 
> http://wkrietveld.appspot.com/38879/diff/1/7#newcode106
> WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:106: return original_open(name, mode, encoding)
> Should we make sure the mode here is 'r' to avoid ever writing to disk?

No, you need to be able to write some files (e.g., the images so they can be diffed, or the results files).

> 
> http://wkrietveld.appspot.com/38879/diff/1/7#newcode124
> WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:124: 'fast/canvas/canvas-zoom.html'])
> Should we also test a case where --update-baseline would update the results in platform/mac?

I don't see how this tests something interestingly different?

> 
> http://wkrietveld.appspot.com/38879/diff/1/9
> File WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py (right):
> 
> http://wkrietveld.appspot.com/38879/diff/1/9#newcode97
> WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py:97: new_baseline=True):
> docstring?
> 
> I don't like this argument name. It reads like this is the actual new baseline, not like a boolean. How about save_new_baseline?
> 
> Also, if we end up changing the name of the command-line flag, I think this would be clearer. This could be update_platform_baseline.

how about "use_platform_dir"?
Comment 9 Dirk Pranke 2010-05-13 18:15:11 PDT
(In reply to comment #7)
> How does this differ from --reset-results?

Hmm. I was not familiar with --reset-results in old-run-webkit-tests (new-run-webkit-tests doesn't have such a flag).

The intended behavior of this new flag is to update the existing baseline wherever it may be found, rather than just writing a new baseline into the platform dir. Is that what reset-results is supposed to do? If so, I'd be happy to use that flag name instead of creating new ones.

It looks like ORWT also has --add-platform-exceptions and --new-test-results, which look like they're similar to --new-baseline in NRWT; I'm not sure quite what the difference is between the two flags, though.
Comment 10 Dirk Pranke 2010-05-17 16:18:39 PDT
Created attachment 56287 [details]
update w/ feedback from ojan, eseidel, abarth

okay, this changes the flag name from --update-baseline to --reset-results (for consistency with old-run-webkit-tests), and adds tests for what happens when the baseline is missing because the file is "new".
Comment 11 Dirk Pranke 2010-05-17 16:20:00 PDT
change bug subject line from "add an --update-baseline switch" to "implement '--reset-results'"
Comment 12 Eric Seidel (no email) 2010-05-17 16:25:47 PDT
Although fast/harness has been used for testing DumpRenderTree bugs in the past.  It seems rather strange to add this test here since you had to make so many Skipped edits as well as hack the path into run-webkit-tests itself!   Seems this test would make more sense as a python unit test.
Comment 13 Dirk Pranke 2010-05-17 16:40:58 PDT
(In reply to comment #12)
> Although fast/harness has been used for testing DumpRenderTree bugs in the past.  It seems rather strange to add this test here since you had to make so many Skipped edits as well as hack the path into run-webkit-tests itself!   Seems this test would make more sense as a python unit test.

I don't think it's actually possible to do this as a python unit test, at least not any more cleanly than this (you'd end up duplicating logic to mock expectations all over the place).

I would eventually like to have a set of tests that are supposed to provoke expected behavior in DRT (one that crashes, one that produces an IMAGE diff, etc) that are actually run. Unfortunately, there's no way with the Skipped files to indicate that. 

Thinking about it, though, I need to change the expectation in the chromium test_expectations from WONTFIX SKIP = FAIL to WONTFIX = MISSING.

As far as hacking the file into test.py itself, that had to be done because there's no way for the test code to know whether there should be an image file or not if the expectation doesn't exist. A more general solution would be a way to mock expected results into test.py, but I didn't want to clutter this particular CL with that change.
Comment 14 Dirk Pranke 2010-05-17 16:43:04 PDT
Created attachment 56289 [details]
change chromium expectation from WONTFIX SKIP = FAIL to WONTFIX = MISSING
Comment 15 Ojan Vafai 2010-05-17 16:59:14 PDT
(In reply to comment #13)
> Thinking about it, though, I need to change the expectation in the chromium test_expectations from WONTFIX SKIP = FAIL to WONTFIX = MISSING.

I'm confused, shouldn't we be skipping anything in the harness directory? These are only for the python tests, right?

WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:146
 +          if uri.find('fast/harness/missing-expectation') != -1:
No need to put this under fast. I think just a top-level harness directory makes sense if we do need this.

WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527
 +                   "generated results"),
If we're going to go the route of matching old-run-webkit-tests, then you should change --new-baseline to --add-platform-exceptions.

(In reply to comment #8)
> Okay, I attempted to use rietveld, but how do you publish the comments?

1. Click on the rietveld review link.
2. Leave comments. 
3. Click submit.

It's a bit confusing because you go to the review page in order to respond to review feedback. Not sure how to make it better.

> > http://wkrietveld.appspot.com/38879/diff/1/7#newcode124
> > WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:124: 'fast/canvas/canvas-zoom.html'])
> > Should we also test a case where --update-baseline would update the results in platform/mac?
> 
> I don't see how this tests something interestingly different?

It tests that we don't always just update the results next to the test when using --reset-results.

> > http://wkrietveld.appspot.com/38879/diff/1/9
> > File WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py (right):
> > 
> > http://wkrietveld.appspot.com/38879/diff/1/9#newcode97
> > WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py:97: new_baseline=True):
> > docstring?
> > 
> > I don't like this argument name. It reads like this is the actual new baseline, not like a boolean. How about save_new_baseline?
> > 
> > Also, if we end up changing the name of the command-line flag, I think this would be clearer. This could be update_platform_baseline.
> 
> how about "use_platform_dir"?

Sure.
Comment 16 Dirk Pranke 2010-05-17 17:05:44 PDT
(In reply to comment #15)
> (In reply to comment #13)
> > Thinking about it, though, I need to change the expectation in the chromium test_expectations from WONTFIX SKIP = FAIL to WONTFIX = MISSING.
> 
> I'm confused, shouldn't we be skipping anything in the harness directory? These are only for the python tests, right?
> 
> WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:146
>  +          if uri.find('fast/harness/missing-expectation') != -1:
> No need to put this under fast. I think just a top-level harness directory makes sense if we do need this.
> 

I'm reusing the existing fast/harness directory, which is used for some DRT-related tests as well (per comments from eseidel). I can certainly create a separate directory if that would be better.
Comment 17 Dirk Pranke 2010-05-17 17:09:14 PDT
(In reply to comment #15)
> WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527
>  +                   "generated results"),
> If we're going to go the route of matching old-run-webkit-tests, then you should change --new-baseline to --add-platform-exceptions.
>

It looked like --add-platform-exceptions did something slightly different; I'm not sure what the difference between --add-platform-exceptions and --new-test-results is in old-run-webkit-tests.

> > > http://wkrietveld.appspot.com/38879/diff/1/7#newcode124
> > > WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py:124: 'fast/canvas/canvas-zoom.html'])
> > > Should we also test a case where --update-baseline would update the results in platform/mac?
> > 
> > I don't see how this tests something interestingly different?
> 
> It tests that we don't always just update the results next to the test when using --reset-results.

I gotcha. Good idea; I'll add that.

> 
> > > http://wkrietveld.appspot.com/38879/diff/1/9
> > > File WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py (right):
> > > 
> > > http://wkrietveld.appspot.com/38879/diff/1/9#newcode97
> > > WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py:97: new_baseline=True):
> > > docstring?
> > > 
> > > I don't like this argument name. It reads like this is the actual new baseline, not like a boolean. How about save_new_baseline?
> > > 
> > > Also, if we end up changing the name of the command-line flag, I think this would be clearer. This could be update_platform_baseline.
> > 
> > how about "use_platform_dir"?
> 
> Sure.

I ended up with generate_new_baseline, which was close to your save_new_baseline suggestion.
Comment 18 Ojan Vafai 2010-05-17 17:13:44 PDT
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #13)
> > > Thinking about it, though, I need to change the expectation in the chromium test_expectations from WONTFIX SKIP = FAIL to WONTFIX = MISSING.
> > 
> > I'm confused, shouldn't we be skipping anything in the harness directory? These are only for the python tests, right?
> > 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:146
> >  +          if uri.find('fast/harness/missing-expectation') != -1:
> > No need to put this under fast. I think just a top-level harness directory makes sense if we do need this.
> > 
> 
> I'm reusing the existing fast/harness directory, which is used for some DRT-related tests as well (per comments from eseidel). I can certainly create a separate directory if that would be better.

Oh. NM. I didn't realize the directory already existed.

Anyways, Eric and I talked about this in person. Instead of using codecs, could you create a python unittest by doing the following?

1. Create a temp directory and populate it with a few dummy tests that test the cases you care about.
2. Have the port look at the temp directory instead of LayoutTests.

Then you could actually just have unittests. They write to the real file system, but they don't need to do anything involving the real checkout.


> > WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527
> >  +                   "generated results"),
> > If we're going to go the route of matching old-run-webkit-tests, then you should change --new-baseline to --add-platform-exceptions.
> 
> It looked like --add-platform-exceptions did something slightly different; I'm not sure what the difference between --add-platform-exceptions and --new-test-results is in old-run-webkit-tests.

I think --new-test-results is whether to generate new results if you run a test and it's missing expectations. old-run-webkit-tests does this by default. It's like --reset-results, but only for tests that are currently missing expectations.
Comment 19 Dirk Pranke 2010-05-17 17:34:26 PDT
(In reply to comment #18)
> > I'm reusing the existing fast/harness directory, which is used for some DRT-related tests as well (per comments from eseidel). I can certainly create a separate directory if that would be better.
> 
> Oh. NM. I didn't realize the directory already existed.
> 
> Anyways, Eric and I talked about this in person. Instead of using codecs, could you create a python unittest by doing the following?
> 
> 1. Create a temp directory and populate it with a few dummy tests that test the cases you care about.
> 2. Have the port look at the temp directory instead of LayoutTests.
> 
> Then you could actually just have unittests. They write to the real file system, but they don't need to do anything involving the real checkout.
>

I could do this, but (a) this would involve writing more code in the unit tests module to do setup and cleanup, and (b) it would be slower - you'd actually be creating, writing, and deleting files. Mocking out codecs.open() seems pretty straightforward, so I'm not seeing much upside to your idea. Am I missing something?
 
> 
> > > WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py:1527
> > >  +                   "generated results"),
> > > If we're going to go the route of matching old-run-webkit-tests, then you should change --new-baseline to --add-platform-exceptions.
> > 
> > It looked like --add-platform-exceptions did something slightly different; I'm not sure what the difference between --add-platform-exceptions and --new-test-results is in old-run-webkit-tests.
> 
> I think --new-test-results is whether to generate new results if you run a test and it's missing expectations. old-run-webkit-tests does this by default. It's like --reset-results, but only for tests that are currently missing expectations.

Okay. We can certainly rename --new-baseline to --add-platform-exceptions, although I think this name is not an improvement in clarity (nor is --reset-results). And, in this case, there may be Chromium people used to --new-baseline so there's a minor compatibility question.

How about you and Eric reach a consensus and I'll do whatever you two say?
Comment 20 Dirk Pranke 2010-05-17 17:52:43 PDT
Thinking about this a bit more ...

If the primary objection is having to put test expectations into the real tree for this, we could create a separate tree outside of WebKit/LayoutTests, and point to that instead of WebKit/LayoutTests/* (i.e., test.py would override layout_tests_dir()). The files would still be checked in.

Also, you'd probably still want to mock out the file write() calls since overwriting existing files could be fragile.
Comment 21 Dirk Pranke 2010-05-18 15:37:14 PDT
Created attachment 56416 [details]
add test for platform-specific baseline, refactor into two unit tests, one for each switch
Comment 22 Dirk Pranke 2010-05-18 17:11:26 PDT
Created attachment 56428 [details]
okay, use a separate layout_tests_dir under webkitpy/layout_tests/data to isolate unit test data from actual layout tests
Comment 23 Dirk Pranke 2010-05-18 17:13:13 PDT
Created attachment 56430 [details]
lint
Comment 24 Ojan Vafai 2010-05-18 17:23:58 PDT
Comment on attachment 56428 [details]
okay, use a separate layout_tests_dir under webkitpy/layout_tests/data to isolate unit test data from actual layout tests

WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:115
 +          with codecs.open(expectations_path, "r", "utf-8") as file:
can we just use regular open here?

WebKitTools/Scripts/webkitpy/layout_tests/test_types/image_diff.py:84
 +            generate_new_baseline: is this a new baseline or an updated one?
This description is wrong. Makes it sound like this is actually a baseline, not a boolean. The description you have below in test_type_base.py would be fine. Although, I still prefer these descriptions to be statements, not questions, e.g. "whether to generate a new baseline or update an existing one".
Comment 25 Ojan Vafai 2010-05-18 17:24:54 PDT
Comment on attachment 56430 [details]
lint

r=me, but please address the two comments for the previous patch.
Comment 26 Dirk Pranke 2010-05-18 17:31:17 PDT
(In reply to comment #24)
> (From update of attachment 56428 [details])
> WebKitTools/Scripts/webkitpy/layout_tests/port/test.py:115
>  +          with codecs.open(expectations_path, "r", "utf-8") as file:
> can we just use regular open here?
> 

We probably could, but the other files use codecs.open() and I thought eric wanted us to consistently use codecs.open() all over the place. I'll leave it like this when I commit the change; if eric indicates that it's okay to change this to a regular file open I'll do a separate change for that.

> WebKitTools/Scripts/webkitpy/layout_tests/test_types/image_diff.py:84
>  +            generate_new_baseline: is this a new baseline or an updated one?
> This description is wrong. Makes it sound like this is actually a baseline, not a boolean. The description you have below in test_type_base.py would be fine. Although, I still prefer these descriptions to be statements, not questions, e.g. "whether to generate a new baseline or update an existing one".

I will update the comment.
Comment 27 Dirk Pranke 2010-05-18 17:52:32 PDT
Committed r59725: <http://trac.webkit.org/changeset/59725>