Bug 136722 - Blink Merge: Remove 'http_lock' code from webkitpy
Summary: Blink Merge: Remove 'http_lock' code from webkitpy
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Brent Fulgham
URL:
Keywords:
Depends on: 136711
Blocks: 138958
  Show dependency treegraph
 
Reported: 2014-09-10 16:25 PDT by Brent Fulgham
Modified: 2014-12-09 11:32 PST (History)
4 users (show)

See Also:


Attachments
WIP patch (26.48 KB, patch)
2014-11-21 03:24 PST, Csaba Osztrogonác
no flags Details | Formatted Diff | Diff
Patch (30.22 KB, patch)
2014-12-09 05:03 PST, Csaba Osztrogonác
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Brent Fulgham 2014-09-10 16:25:06 PDT
Merge this webkitpy change into our repository: <https://chromium.googlesource.com/chromium/blink/+/196f8146a948275c2f1594b13e30ab19a6e6fd66>
Comment 1 Csaba Osztrogonác 2014-09-11 05:46:41 PDT
Why? Did it caused any problem?
Comment 2 Brent Fulgham 2014-09-11 09:19:49 PDT
(In reply to comment #1)
> Why? Did it caused any problem?

From the original Blink comments:

"We used to allow running run-webkit-tests concurrently and they would
coordinate amongst themselves when running the http tests using file
locks. This was used by the Qt port on their buildbots, but it's not
clear that this was needed anywhere else.

This patch removes all of that code, and deletes the generic file lock
module we had as it was now only being used when parsing ASAN logs,
which is only needed on mac and linux (where we can just use flock).

This patch also moves starting and stopping the servers from the
layout_test_runner to the manager, where we can wait to stop the servers
until after retrying any tests. This keeps us from having to start the
servers twice if we need to retry an HTTP test. This will speed up
HTTP retries a tiny bit, but slow down the overall run-webkit-tests
execution a bit in the normal case (no HTTP retries). This tradeoff
seems worth it in order to simplify the logic overall."

Since we no longer support the Qt port, it make sense to get rid of this complexity if it's not actively being used.
Comment 3 Csaba Osztrogonác 2014-11-20 04:24:51 PST
I know it was introduced for Qt buildbots to be able more run-webkit-tests
on the same machine. Know there is no Qt port, and it seems there is no
real need to be able to do it.

I still think it can be useful if you have only one shared tester 
machine to be able run tests on a WebKit port you don't work primarily.
Or if you work on two different bugs parallely, and you might would
like to run a test until you wait a full test run in the background.
Without this mechanism, the second RWT would kill/restart the apache
started by the first RWT and tests would fail.

If this useful mechanism causes problems anywhere, I prefer fixing it.
If it isn't possible easily, it's OK for me to get rid of it. But in
this case we would need some mechanism to warn or prevent the user
to try to run more than one RWT in the same time.
Comment 4 Alexey Proskuryakov 2014-11-20 09:39:01 PST
This code is quite complicated and confusing. Even if it doesn't actively break things, it definitely makes hacking run-webkit-tests harder.

Also, I'm not really sure what the benefit of running multiple run-webkit-test instances in parallel is. Why not just run them serially?
Comment 5 Csaba Osztrogonác 2014-11-20 09:54:49 PST
(In reply to comment #4)
> This code is quite complicated and confusing. Even if it doesn't actively
> break things, it definitely makes hacking run-webkit-tests harder.
Is there a concrete issue which can be solved harder because of the 
existance of httpd locking?
 
> Also, I'm not really sure what the benefit of running multiple
> run-webkit-test instances in parallel is. Why not just run them serially?

Here is a simple use case for it: I think that I fixed a huge bug, but I
want to make sure if all tests pass. Run all tests can take at least 15-30
minutes in release more, more in debug mode. Until the tests run in the
background, I'm working an other bug, maybe I would like to run
some (1-2-10) tests too during debugging.

Without this httpd locking mechanism, a newly started run-webkit-tests simply
kills the running apache and breaks the first test run. It isn't good at all.

If we stated that running more run-webkit-tests paralelly isn't supported
anymore, we should prevent killing the previously started apache accidentally.
Comment 6 Alexey Proskuryakov 2014-11-20 10:17:34 PST
> Is there a concrete issue which can be solved harder because of the 
existance of httpd locking?

I wasted 10 minutes yesterday trying to figure out how this affects what I'm doing to run http tests in parallel. And this all turned out to be unused code.

> Here is a simple use case for it

This case doesn't work now, because the second instance will be simply blocked waiting for the lock.

> we should prevent killing the previously started apache accidentally

Do you have a suggestion for how to achieve this? We need to kill runaway httpd processes (most notably when there is a forgotten run-webkit-httpd script in one of the tabs, but also for any other possible problems).
Comment 7 Csaba Osztrogonác 2014-11-20 10:45:00 PST
(In reply to comment #6)
> > Is there a concrete issue which can be solved harder because of the 
> existance of httpd locking?
> 
> I wasted 10 minutes yesterday trying to figure out how this affects what I'm
> doing to run http tests in parallel. And this all turned out to be unused
> code.

As for me, I run http tests parallel long ago on EFL port and they are quit
stable. It is hard coded to 1 thread in base.py, I don't know the reason,
maybe there were/are issues on other ports.

Running parrallel normal and locked (http + svg-perf) threads can be
configured with:
WEBKIT_TEST_CHILD_PROCESSES env and --child-processes
WEBKIT_TEST_MAX_LOCK_SHARDS env and --max-locked-shards

It is very good to fine tune parallelism or debug issues related
to parallel test running. Or simple set the best and fastest setting
locally. (On Linux WEBKIT_TEST_CHILD_PROCESSES is necessary for users
who build with icecc on 30 threads, but run tests only on 4-8 threads)

> This case doesn't work now, because the second instance will be simply
> blocked waiting for the lock.
It is blocked if http test running is in progress. After the last test
fininshed, the lock is freed and the second instance can start running
http tests.
 
> > we should prevent killing the previously started apache accidentally
> Do you have a suggestion for how to achieve this? We need to kill runaway
> httpd processes (most notably when there is a forgotten run-webkit-httpd
> script in one of the tabs, but also for any other possible problems).

I can imagine a friendly question before killing the httpd, at least in
interactive mode. It can be optional, opt-in or opt-out, it's all the same.
Comment 8 Csaba Osztrogonác 2014-11-20 10:47:08 PST
(In reply to comment #7)
> WEBKIT_TEST_MAX_LOCK_SHARDS env and --max-locked-shards
typo: WEBKIT_TEST_MAX_LOCK_SHARDS --> WEBKIT_TEST_MAX_LOCKED_SHARDS
Comment 9 Alexey Proskuryakov 2014-11-20 10:50:10 PST
> I don't know the reason, maybe there were/are issues on other ports.

Once again, I'm talking about code complexity. This feature is extremely invasive for something that is unused.

> It is blocked if http test running is in progress.

But aren't they always in progress? On my machine, http tests run longer than all other tests combined.
Comment 10 Csaba Osztrogonác 2014-11-20 12:22:40 PST
(In reply to comment #9)
> > I don't know the reason, maybe there were/are issues on other ports.
> Once again, I'm talking about code complexity. This feature is extremely
> invasive for something that is unused.
I see, I don't say we shouldn't remove this complexity. But before doing
it, we could increase the maximum number of locked shards in base.py:
----------------------------------------------------------------------------
    def default_child_processes(self):
        """Return the number of DumpRenderTree instances to use for this port."""
        return self._executive.cpu_count()

    def default_max_locked_shards(self):
        """Return the number of "locked" shards to run in parallel (like the http tests)."""
        return 1
----------------------------------------------------------------------------
Let's start experimenting with the same numbers:
def default_max_locked_shards(self):
    return self.default_child_processes()

Note: default_child_processes() is hard coded to 2 for win, mac 
has a sophisticated algorithm, EFL and GTK use this default value.

If all ports are happy and stable for a time with same unlocked and 
locked thread number, than we can remove all of these complex code paths. ;-)

> > It is blocked if http test running is in progress.
> But aren't they always in progress? On my machine, http tests run longer
> than all other tests combined.

It can be true if only 1 thread is enabled for http tests. I ran a quick test,
https tests took 4.5 minutes (on 4 threads), all tests took 22.5 minutes.
(EFL / Linux)
Comment 11 Alexey Proskuryakov 2014-11-20 13:42:49 PST
> But before doing it, we could increase the maximum number of locked shards in base.py

I don't understand how this is related to removing the http lock.

I am now actively working on making http tests parallel. That doesn't involve any experimenting, I'm fixing actual issues that prevent parallelism. But that is orthogonal to running multiple run-webkit-tests instances, because multiple instances would still fight for port 8000 and for temporary files created by PHP and Perl scripts.
Comment 12 Alexey Proskuryakov 2014-11-20 13:47:19 PST
>> But aren't they always in progress? On my machine, http tests run longer
>> than all other tests combined.

> It can be true if only 1 thread is enabled for http tests.

HTTP tests are sequential now, there is always one thread. Trying to run more will make tests flaky, for multiple reasons that I'm fixing now.

You are suggesting a new feature that hasn't been requested before we started talking about removing unused code. This doesn't strike me as a high level of interest from the community.
Comment 13 Csaba Osztrogonác 2014-11-21 03:16:00 PST
(In reply to comment #11)
> > But before doing it, we could increase the maximum number of locked shards in base.py
> I don't understand how this is related to removing the http lock.

I thought you suggested removing the whole locking shards mechanism. It would
really decrease the complexity of the code. But removing only http locking
wouldn't decrease any complexity. http locking adds only 1-1 lines to
this more complex locked sharding mechanism:
 - self._port.acquire_http_lock() before starting httpd
 - self._port.release_http_lock() after stopping httpd

I still don't have objection against removing it, just asked to have 
a graceful mechanism not kill running httpds unasked and agressively.

And if we are doing cleanups, go forward, and remove the whole locking
shards complexity and run https tests parallel when all ports are ready
for it and they are very stable.

> I am now actively working on making http tests parallel. That doesn't
> involve any experimenting, I'm fixing actual issues that prevent
> parallelism. But that is orthogonal to running multiple run-webkit-tests
> instances, because multiple instances would still fight for port 8000 and
> for temporary files created by PHP and Perl scripts.

As I said previously, parallel http testing is stable for long time ago,
at least on EFL port. I already have WEBKIT_TEST_MAX_LOCK_SHARDS=4 in my
environment for 2 years and haven't seen any flakiness related to it.
Sharding guarantess that tests in the same directory aren't run parallel.
Of course rare flakiness can happen if php and perl scripts are fighting
for the same temporary files and full parallelism is enabled. Or if two
different tests in different directories want to use the same temporary
file in the same time. But this is a bug in the test infrastructure and 
in the tests too. To avoid all of these possible parallelism issues, we
shouldn't let any test to write a hard coded file name on the filesystem. 
Now the run-webkit-test creates an own temporary directory for each DRT/WTR
with <DumpRenderTree|WebkitTestRunner>-XXXXX . DRT/WTR know the path of this
temporary directory, it can (and should) pass it to the perl/php scripts and
they should use only this path for temporary files.

(In reply to comment #12)
> >> But aren't they always in progress? On my machine, http tests run longer
> >> than all other tests combined.
> > It can be true if only 1 thread is enabled for http tests.
> HTTP tests are sequential now, there is always one thread. Trying to run
> more will make tests flaky, for multiple reasons that I'm fixing now.

Just out of curiosity, I tried parallel running on Mac too.
(On a week old build, not checked trunk after your fixes.)
I didn't get any flakiness with 4 parallel threads and http
tests ran in 2.5 minutes. With 8 threads: 2 minutes, no flakiness.
 
> You are suggesting a new feature that hasn't been requested before we
> started talking about removing unused code. This doesn't strike me as a high
> level of interest from the community.
I suggested a new feature (not to kill httpd agressively without asking) 
after you stated http locking unused code. It is unused on Mac, because
you have overriden the acquire_http_lock/release_http_lock methods. But
it is used for other ports, and useful in the use case I wrote before.
Comment 14 Csaba Osztrogonác 2014-11-21 03:17:32 PST
(In reply to comment #0)
> Merge this webkitpy change into our repository:
> <https://chromium.googlesource.com/chromium/blink/+/
> 196f8146a948275c2f1594b13e30ab19a6e6fd66>

file lock can't be removed, it is still used by xvfbdriver.py,
changed the title accordingly.
Comment 15 Csaba Osztrogonác 2014-11-21 03:24:46 PST
Created attachment 242035 [details]
WIP patch

WIP patch, it removes only the lock, not move start/stop servers from the layout_test_runner to the manager. In my opinion it is useful to stop httpd when it isn't needed anymore. It doesn't make sense to start the httpd at the beginning and only stop at the end. TODO: find a way not to kill already runnning httpd agressively without asking the user.
Comment 16 Alexey Proskuryakov 2014-12-03 11:22:52 PST
> I thought you suggested removing the whole locking shards mechanism.

What I'm saying is that this code is so hopelessly confusing that I can't even figure out how to make http tests parallel. I've been testing with "-f", but how do I make them normal tests, like those in fast/ directory?

When I try to change anything, it either breaks locking for perf tests, or refuses to launch httpd (even if I change Manager.needs_servers() to return True).

Then again, I'm not sure if locking does the right thing for perf tests either. Do we still run them in parallel with all other tests?
Comment 17 Csaba Osztrogonác 2014-12-03 12:16:20 PST
(In reply to comment #16)
> > I thought you suggested removing the whole locking shards mechanism.
> 
> What I'm saying is that this code is so hopelessly confusing that I can't
> even figure out how to make http tests parallel. I've been testing with
> "-f", but how do I make them normal tests, like those in fast/ directory?
>
> When I try to change anything, it either breaks locking for perf tests, or
> refuses to launch httpd (even if I change Manager.needs_servers() to return
> True).
> 
> Then again, I'm not sure if locking does the right thing for perf tests
> either. Do we still run them in parallel with all other tests?

Making http tests to normal tests should be easy. Just remove the is_http_test
from _test_requires_lock in manager.py. But in this case starting servers should
be unconditional too. (in layout_test_runner.py) Now httpd are stopped after
the last locked test. It would be broken after making http tests normal test.

The blink change solves this problem with starting httpd at the beginning
and stopping it at the end. And leave locking mechanism for the 24 tests
in LayoutTests/perf directory. But locking is absolutely useless here,
these tests are in one directory, it is one shard, they can't run in
parallel with each other. But they can run parallel with other tests.
Comment 18 Alexey Proskuryakov 2014-12-03 12:38:18 PST
> But they can run parallel with other tests.

Can they? I think that's a mistake, perf tests need to run exclusively - it does not matter whether they run in parallel with other tests, or with other perf tests, the impact is the same.
Comment 19 Csaba Osztrogonác 2014-12-03 13:40:34 PST
(In reply to comment #18)
> > But they can run parallel with other tests.
> 
> Can they? I think that's a mistake, perf tests need to run exclusively - it
> does not matter whether they run in parallel with other tests, or with other
> perf tests, the impact is the same.

Yes, they can. They are handled as locked test same as http test. I checked
the code a little bit deeper now. There are two numbers, number of workers
and number of locked shards. Tests are sharded always by leaf subdirectory,
and then _resize_shards integrates locked shards not to have more than the
max number of locked shard. And then all shards are pushed to the pool,
first the locked shard, second the unlocked shards. It doesn't guarantee
that locked and unlocked shard can't run in parallel. It guarantees that
only max_locked_shards can run parallel. But unlocked shards can run too
if the number of workers are bigger than max_locked_shards.

LayoutTests/perf tests aren't a real performance tests, their runtime
doesn't matter, only their magnitude is important. They are to guarantee
a feature has linear/logarithmic/O(n^2) magnitude. But it seems they are
flakey and timeouts regularly. They are skipped on mac, many of them are
skipped on GTK/EFL/Win.

I'm afraid there isn't an easy way to guarantee perf tests run first without
any parallelism and then all the other tests. Or what if we don't run them
during layout testing, but in a separated buildstep - "run-webkit-tests perf"
Comment 20 Alexey Proskuryakov 2014-12-03 13:52:14 PST
Right. This means that locking is already broken for perf tests, because the idea is that they run exclusively, when the machine is otherwise idle:

        """Return True if the test needs to be locked when
        running multiple copies of NRWTs. Perf tests are locked
        because heavy load caused by running other tests in parallel
        might cause some of them to timeout."""

The comment isn't helping either, because this affects threading within a single NRWT, not (or not only?) multiple ones.

Do you plan to finish the WIP patch? Removing the code would make it more feasible for me to hack on what remains.
Comment 21 Alexey Proskuryakov 2014-12-04 12:16:13 PST
> Right. This means that locking is already broken for perf tests

Posted a patch to remove that in bug 139264.
Comment 22 Csaba Osztrogonác 2014-12-09 05:03:52 PST
Created attachment 242909 [details]
Patch

Sorry for the delay, let's remove it. I updated the patch after r176814 and r176830.
Comment 23 Alexey Proskuryakov 2014-12-09 09:39:17 PST
Comment on attachment 242909 [details]
Patch

Thank you! rs=me
Comment 24 WebKit Commit Bot 2014-12-09 11:32:19 PST
Comment on attachment 242909 [details]
Patch

Clearing flags on attachment: 242909

Committed r177028: <http://trac.webkit.org/changeset/177028>
Comment 25 WebKit Commit Bot 2014-12-09 11:32:22 PST
All reviewed patches have been landed.  Closing bug.