Bug 202993

Summary: Add the timeout-factor parameter in browserperfdash-benchmark
Product: WebKit Reporter: Pablo Saavedra <psaavedra>
Component: Tools / TestsAssignee: Pablo Saavedra <psaavedra>
Status: REOPENED ---    
Severity: Normal CC: clopez, dewei_zhu, ews-watchlist, glenn, jbedard, rniwa, webkit-bug-importer
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=248498
Attachments:
Description Flags
patch
none
patch
none
patch rniwa: review-

Description Pablo Saavedra 2019-10-15 10:08:42 PDT
The timeout-factor is useful to run the tests in slower devices where the default timeout value defined in the .plan files is not high enough to let the tests end.
Comment 1 Pablo Saavedra 2019-10-15 10:12:18 PDT
Created attachment 380997 [details]
patch
Comment 2 Pablo Saavedra 2019-10-16 00:06:26 PDT
Created attachment 381058 [details]
patch
Comment 3 Carlos Alberto Lopez Perez 2019-10-18 06:24:56 PDT
Comment on attachment 381058 [details]
patch

View in context: https://bugs.webkit.org/attachment.cgi?id=381058&action=review

> Tools/Scripts/webkitpy/browserperfdash/browserperfdash_runner.py:58
> +    parser.add_argument('--timeout-factor', dest='timeoutFactorOverride', type=int, help='Timeout factor useful to run test in slower devices. e.g. 10')

As a user reading the help, I would find the explanation about what timeout-factor is a bit confusing. I would write instead something like:
"Multiply the test timeout by the given factor, useful to run tests in slower devices, e.g.: 10'
Comment 4 Pablo Saavedra 2019-10-18 06:32:46 PDT
Created attachment 381298 [details]
patch
Comment 5 Carlos Alberto Lopez Perez 2019-10-18 07:14:03 PDT
I wonder if it would make sense to also add this parameter to Tools/Scripts/run-benchmark? I guess that if you are running the tests manually one by one with run-benchmark on a slow device it may be useful as well
Comment 6 Ryosuke Niwa 2019-10-18 15:42:44 PDT
Comment on attachment 381298 [details]
patch

The right fix is to create a new plan file for those slower devices.
Comment 7 Carlos Alberto Lopez Perez 2019-10-19 08:11:58 PDT
(In reply to Ryosuke Niwa from comment #6)
> Comment on attachment 381298 [details]
> patch
> 
> The right fix is to create a new plan file for those slower devices.

I don't think that is the right fix.

On our bots we run all the plans, by passing the argument "--allplans" to the tool.

Duplicating each one of the plans to add different timeouts parameters is totally incompatible with using "--allplans" since you will be running twice the same tests just with different timeout parameters.
Comment 8 Ryosuke Niwa 2019-10-19 14:11:01 PDT
(In reply to Carlos Alberto Lopez Perez from comment #7)
> (In reply to Ryosuke Niwa from comment #6)
> > Comment on attachment 381298 [details]
> > patch
> > 
> > The right fix is to create a new plan file for those slower devices.
> 
> I don't think that is the right fix.
> 
> On our bots we run all the plans, by passing the argument "--allplans" to
> the tool.

The design of run-benchmark is to use --plan and specify a plan file for each platform, device, etc... configuration with customized parameters.

> Duplicating each one of the plans to add different timeouts parameters is
> totally incompatible with using "--allplans" since you will be running twice
> the same tests just with different timeout parameters.

It looks like --allplans https://trac.webkit.org/changeset/197505 added this option but that was probably a mistake in the first place. This is not the way run-benchmark is supposed to be used in the first place so it's natural that a fix would require even more code change that's incompatible with the design of the script.
Comment 9 Ryosuke Niwa 2019-10-19 14:12:14 PDT
The only acceptable alternative solution here (other than creating & specifying individual plan files), in my view, is to add --timeout option which overrides the default timeout of each plan.
Comment 10 Pablo Saavedra 2019-10-21 10:37:11 PDT
(In reply to Ryosuke Niwa from comment #9)
> The only acceptable alternative solution here (other than creating &
> specifying individual plan files), in my view, is to add --timeout option
> which overrides the default timeout of each plan.

I'd like to still keep the timeout-factor but if there  is no other choice and you are ok with the --timeout, then I could modify the bug to do this in your suggested way
Comment 11 Carlos Alberto Lopez Perez 2019-10-21 11:14:37 PDT
(In reply to Ryosuke Niwa from comment #9)
> The only acceptable alternative solution here (other than creating &
> specifying individual plan files), in my view, is to add --timeout option
> which overrides the default timeout of each plan.

Would it be acceptable to keep --timeout-factor only for the browserperfdash-benchmark runner like the current patch does? It doesn't expose --timeout-factor to the run-benchmark script.
Comment 12 Ryosuke Niwa 2019-10-21 16:01:16 PDT
It would be not acceptable to introduce this concept into webkitpy.benchmark_runner classes. If you can figure out a way to do that, I don't really care because this is mostly a maintenance cost issue.
Comment 13 Carlos Alberto Lopez Perez 2019-10-21 18:01:44 PDT
(In reply to Ryosuke Niwa from comment #12)
> It would be not acceptable to introduce this concept into
> webkitpy.benchmark_runner classes. If you can figure out a way to do that, I
> don't really care because this is mostly a maintenance cost issue.

Ok. Then I guess that a possibility is that in the script browserperfdash-runner we create a new class (for example: BrowserperfdashRunner) that inherits from the BenchmarkRunner class and use there this timeout-factor option.
Comment 14 Ryosuke Niwa 2019-10-22 11:49:45 PDT
(In reply to Carlos Alberto Lopez Perez from comment #13)
> (In reply to Ryosuke Niwa from comment #12)
> > It would be not acceptable to introduce this concept into
> > webkitpy.benchmark_runner classes. If you can figure out a way to do that, I
> > don't really care because this is mostly a maintenance cost issue.
> 
> Ok. Then I guess that a possibility is that in the script
> browserperfdash-runner we create a new class (for example:
> BrowserperfdashRunner) that inherits from the BenchmarkRunner class and use
> there this timeout-factor option.

I don't think that satisfies the goal of not increasing the maintenance cost of webkitpy.benchmark_runner class. If you did subclass, then the future refactoring will have to be constrained by the existence of this flag.
Comment 15 Pablo Saavedra 2019-10-23 00:11:35 PDT
(In reply to Ryosuke Niwa from comment #14)
> (In reply to Carlos Alberto Lopez Perez from comment #13)
> > (In reply to Ryosuke Niwa from comment #12)
> > > It would be not acceptable to introduce this concept into
> > > webkitpy.benchmark_runner classes. If you can figure out a way to do that, I
> > > don't really care because this is mostly a maintenance cost issue.
> > 
> > Ok. Then I guess that a possibility is that in the script
> > browserperfdash-runner we create a new class (for example:
> > BrowserperfdashRunner) that inherits from the BenchmarkRunner class and use
> > there this timeout-factor option.
> 
> I don't think that satisfies the goal of not increasing the maintenance cost
> of webkitpy.benchmark_runner class. If you did subclass, then the future
> refactoring will have to be constrained by the existence of this flag.

Just to be in the same page. What you suggest is change the proposed `--timeout-factor` in favour of an '--timeout`, right?

What I don't see is the point of how this satisfies much better the goal of not increasing the maintenance cost ... because basically implies the same changes in the code (the same changes in the constructor methods, similar addition in the ArgParser ...). Did I miss something?
Comment 16 Ryosuke Niwa 2019-10-23 01:43:33 PDT
(In reply to Pablo Saavedra from comment #15)
> (In reply to Ryosuke Niwa from comment #14)
> > (In reply to Carlos Alberto Lopez Perez from comment #13)
> > > (In reply to Ryosuke Niwa from comment #12)
> > >
> > > Ok. Then I guess that a possibility is that in the script
> > > browserperfdash-runner we create a new class (for example:
> > > BrowserperfdashRunner) that inherits from the BenchmarkRunner class and use
> > > there this timeout-factor option.
> > 
> > I don't think that satisfies the goal of not increasing the maintenance cost
> > of webkitpy.benchmark_runner class. If you did subclass, then the future
> > refactoring will have to be constrained by the existence of this flag.
> 
> Just to be in the same page. What you suggest is change the proposed
> `--timeout-factor` in favour of an '--timeout`, right?
> 
> What I don't see is the point of how this satisfies much better the goal of
> not increasing the maintenance cost ... because basically implies the same
> changes in the code (the same changes in the constructor methods, similar
> addition in the ArgParser ...). Did I miss something?

Because --timeout is something I can see ourselves using in our port, and it's an option supported by run-webkit-tests and run-perf-tests.
Comment 17 Carlos Alberto Lopez Perez 2019-10-23 04:38:45 PDT
(In reply to Ryosuke Niwa from comment #16)
> (In reply to Pablo Saavedra from comment #15)
> > (In reply to Ryosuke Niwa from comment #14)
> > > (In reply to Carlos Alberto Lopez Perez from comment #13)
> > > > (In reply to Ryosuke Niwa from comment #12)
> > > >
> > > > Ok. Then I guess that a possibility is that in the script
> > > > browserperfdash-runner we create a new class (for example:
> > > > BrowserperfdashRunner) that inherits from the BenchmarkRunner class and use
> > > > there this timeout-factor option.
> > > 
> > > I don't think that satisfies the goal of not increasing the maintenance cost
> > > of webkitpy.benchmark_runner class. If you did subclass, then the future
> > > refactoring will have to be constrained by the existence of this flag.
> > 
> > Just to be in the same page. What you suggest is change the proposed
> > `--timeout-factor` in favour of an '--timeout`, right?
> > 
> > What I don't see is the point of how this satisfies much better the goal of
> > not increasing the maintenance cost ... because basically implies the same
> > changes in the code (the same changes in the constructor methods, similar
> > addition in the ArgParser ...). Did I miss something?
> 
> Because --timeout is something I can see ourselves using in our port, and
> it's an option supported by run-webkit-tests and run-perf-tests.

Currently the run-benchmark script has already options that are not supported by run-webkit-tests and run-perf-tests (and that are also useful to only one port) like:

  --device-id DEVICE_ID
                        Undocumented option for mobile device testing.


Its even an "Undocumented option". And that is fine for us.

So I hope this is not about what you find useful on your port and what not, because that argument doesn't look like a great way of collaborating on a project like WebKit that is all about ports.


Speaking about the the two options:

--timeout: i can see how this is useful for the run-benchmark script for engineers running manually tests and wanting to run them with a bigger timeout than the plan, and I agree it can be a good idea to add it to the script

--timeout-factor: for the browserperfdash-benchmark script it is definitively more useful than the "--timeout" factor. Because this script is not intended to be ran manually, but as part of a CI, where it runs all the plans. So it allows the device to have a multiplier for the plan timeout instead of hardcoding a value. With a timeout-factor of 4, the kraken plan would have 300*4=1200 and the jetstream2 would have 1200*4=4800. However, if we have to hardcode a 4800 value as timeout, that means kraken would have also 4800. Which means that jetstream2 would run with a multiplier of 4 and kraken with one of 16. :\

We can understand that the timeout-factor can be not all that useful for run-benchmark for your use-case (for us it definitively would be), so we have offered to add it only to the browserperfdash-benchmark script. Even by creating a new class to avoid touching the run-benchmark code.

The proposed code to add to run-benchmark for timeout-factor (basically the patch in this bug) is just 2 added lines and an option, so I don't see how that can be a real concern regarding maintenance.

In any case, if you still think that it will be a maintenance cost that deservers this discussion, then the only way forward I see is that we subclass, and we carry on with the maintenance burden. If you want to re-factor, you can do that ignoring browserperdash-runner, we will try follow-up with any fix on our side after that.
Comment 18 Ryosuke Niwa 2019-10-23 14:21:37 PDT
(In reply to Carlos Alberto Lopez Perez from comment #17)> 
> So I hope this is not about what you find useful on your port and what not,
> because that argument doesn't look like a great way of collaborating on a
> project like WebKit that is all about ports.

I find it extremely frustrating that you seem to always on implementing the very first & best solution you come up with instead of listening to the main maintainer of the relevant codebase. This is highly problematic and contrary to the way we normally operate.

If the main maintainer of the code r-es a patch and suggests an alternative, then the usual path forward is to work with the said reviewer to come up with an acceptable solution.

> --timeout-factor: for the browserperfdash-benchmark script it is
> definitively more useful than the "--timeout" factor. Because this script is
> not intended to be ran manually, but as part of a CI, where it runs all the
> plans. So it allows the device to have a multiplier for the plan timeout
> instead of hardcoding a value. With a timeout-factor of 4, the kraken plan
> would have 300*4=1200 and the jetstream2 would have 1200*4=4800. However, if
> we have to hardcode a 4800 value as timeout, that means kraken would have
> also 4800. Which means that jetstream2 would run with a multiplier of 4 and
> kraken with one of 16.

I worked fully time on the perf bots & maintainable for four years, and my experience is that no device will be faster or slower at a constant rate across multitude of benchmarks. For example, MotionMark's performance would radically depend on GPU performance whereas Speedometer doesn't depend on GPU perf at all.

The performance characteristics of Speedometer and JetStream are also radically different with regards to the number of CPU cores. When there are more cores, JetStream tends to scale & its runtime tends to shrink. Speedometer is primarily single core benchmark and it would not exhibit such a speed or reduction in its runtime. Furthermore, benchmarks like MotionMark has a fixed runtime regardless of how slow a device is.

> We can understand that the timeout-factor can be not all that useful for
> run-benchmark for your use-case (for us it definitively would be), so we
> have offered to add it only to the browserperfdash-benchmark script. Even by
> creating a new class to avoid touching the run-benchmark code.

I fundamentally object to your premise that this is a useful option even in bot environment.

> In any case, if you still think that it will be a maintenance cost that
> deservers this discussion, then the only way forward I see is that we
> subclass, and we carry on with the maintenance burden. If you want to
> re-factor, you can do that ignoring browserperdash-runner, we will try
> follow-up with any fix on our side after that.

This is not an acceptable path forward in my view.
Comment 19 Ryosuke Niwa 2019-10-23 14:26:55 PDT
Let me also clarify. run-bencharmk is one of the most mission critical script for Apple's WebKit team. 

Any code change to it must be vetted very thoroughly because any brokerage or amendment can totally ruin our effort to defend the performance of WebKit and the rest of the operating systems Apple is working on.
Comment 20 Carlos Alberto Lopez Perez 2019-10-25 06:02:25 PDT
(In reply to Ryosuke Niwa from comment #18)
> (In reply to Carlos Alberto Lopez Perez from comment #17)> 
> > So I hope this is not about what you find useful on your port and what not,
> > because that argument doesn't look like a great way of collaborating on a
> > project like WebKit that is all about ports.
> 
> I find it extremely frustrating that you seem to always on implementing the
> very first & best solution you come up with instead of listening to the main
> maintainer of the relevant codebase. This is highly problematic and contrary
> to the way we normally operate.
> 

I'm also kind of frustrated with this.

And I don't know why you seem to think that I'm not listening what you propose.
I have listened carefully to your arguments and to the alternatives proposed.

And I still think the best technical solution for the use case we want to solve is the --timeout-factor and not the other alternatives proposed.


> > --timeout-factor: for the browserperfdash-benchmark script it is
> > definitively more useful than the "--timeout" factor. Because this script is
> > not intended to be ran manually, but as part of a CI, where it runs all the
> > plans. So it allows the device to have a multiplier for the plan timeout
> > instead of hardcoding a value. With a timeout-factor of 4, the kraken plan
> > would have 300*4=1200 and the jetstream2 would have 1200*4=4800. However, if
> > we have to hardcode a 4800 value as timeout, that means kraken would have
> > also 4800. Which means that jetstream2 would run with a multiplier of 4 and
> > kraken with one of 16.
> 
> I worked fully time on the perf bots & maintainable for four years, and my
> experience is that no device will be faster or slower at a constant rate
> across multitude of benchmarks. For example, MotionMark's performance would
> radically depend on GPU performance whereas Speedometer doesn't depend on
> GPU perf at all.
> 
> The performance characteristics of Speedometer and JetStream are also
> radically different with regards to the number of CPU cores. When there are
> more cores, JetStream tends to scale & its runtime tends to shrink.
> Speedometer is primarily single core benchmark and it would not exhibit such
> a speed or reduction in its runtime. 

I agree with the idea that comparing the runtime (or performance values) of different benchmarks on different machines is not as simple as multiplying for a constant.

However we are talking about timeouts, and the timeout value of a test it is not some scientific constant that needs to be carefully obtained.
It is an approximation of the maximum time that seems reasonable to wait for a test to finish.
And that time is usually highly dependent on the computing power of the machine doing the test.
So multiplying that timeout value for a constant when the machine is slower seems like another reasonable approximation to me.


> Furthermore, benchmarks like MotionMark
> has a fixed runtime regardless of how slow a device is.
> 

You have a good point here: --timeout-factor is a bad idea if the test has a fixed runtime.
Maybe the plan file could include a new "timeout_type" key indicating if the timeout is fixed or depends on the computing power ? That way --timeout-factor would only apply when the runtime is not fixed.

> > We can understand that the timeout-factor can be not all that useful for
> > run-benchmark for your use-case (for us it definitively would be), so we
> > have offered to add it only to the browserperfdash-benchmark script. Even by
> > creating a new class to avoid touching the run-benchmark code.
> 
> I fundamentally object to your premise that this is a useful option even in
> bot environment.
> 

??

> > In any case, if you still think that it will be a maintenance cost that
> > deservers this discussion, then the only way forward I see is that we
> > subclass, and we carry on with the maintenance burden. If you want to
> > re-factor, you can do that ignoring browserperdash-runner, we will try
> > follow-up with any fix on our side after that.
> 
> This is not an acceptable path forward in my view.

Ok, I'm giving up with trying to convince you that --timeout-factor is a good idea.

Perhaps going with --timeout may also work for our use case.
Comment 21 Pablo Saavedra 2019-12-23 08:12:15 PST
Set as resolved by mistake
Comment 22 Radar WebKit Bug Importer 2019-12-23 08:12:19 PST
<rdar://problem/58161884>