Bug 63116

Summary: nrwt: allow for multiple http shards
Product: WebKit Reporter: Dirk Pranke <dpranke>
Component: New BugsAssignee: Dirk Pranke <dpranke>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, abecsi, aroben, eric, galpeter, kkristof, ojan, ossy, rgabor, tony
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: PC   
OS: All   
Bug Depends on: 63112, 64092    
Bug Blocks: 75958    
Attachments:
Description Flags
checkpoint work; needs more testing
none
clean up logic, add tests
none
update w/ review feedback from tony
none
try to make _resize_shards clearer
none
update w/ more feedback from tony
none
patch for landing after rebasing to HEAD none

Description Dirk Pranke 2011-06-21 20:52:44 PDT
nrwt: allow for multiple http shards
Comment 1 Dirk Pranke 2011-06-21 20:55:36 PDT
Created attachment 98112 [details]
checkpoint work; needs more testing
Comment 2 Dirk Pranke 2011-06-21 21:10:49 PDT
FYI, the final version of the new sharding logic will probably look something like this, although this needs more tests and I think I can clean it up more.

The basic idea is to limit the # of concurrent HTTP tests to something sane like max(# cpus/4, 1) so that we don't hammer the server too much and increase flakiness. I think even splitting the HTTP tests up into 4 chunks should be enough to make them not on the critical path, but I haven't tested the perf improvements yet.
Comment 3 Tony Chang 2011-06-22 10:39:45 PDT
Are both lighttpd and apache2 configured to handle a reasonable number of concurrent connections (maybe 3 times the number of DRT processes)?  Are there any tests that try to set/get cookies that might start failing?
Comment 4 Dirk Pranke 2011-06-22 10:56:49 PDT
(In reply to comment #3)
> Are both lighttpd and apache2 configured to handle a reasonable number of concurrent connections (maybe 3 times the number of DRT processes)?  Are there any tests that try to set/get cookies that might start failing?

Good questions. I don't yet know the answers. Presumably the answer to the cookie question is yes :).

However, given that we've had --experimental-fully-parallel forever and I've not noticed any real problems with http specifically, I think we're probably fine, but definitely more testing is needed before we turn this on by default.
Comment 5 Eric Seidel (no email) 2011-06-22 10:58:58 PDT
There are php-based cookie tests which are known to be flaky in our current setup.  I could forsee them failing in a sharded environment.  Unclear.

See https://bugs.webkit.org/show_bug.cgi?id=51613 for one example.
Comment 6 Dirk Pranke 2011-06-22 20:04:50 PDT
Created attachment 98302 [details]
clean up logic, add tests
Comment 7 Tony Chang 2011-06-24 15:16:26 PDT
Comment on attachment 98302 [details]
clean up logic, add tests

View in context: https://bugs.webkit.org/attachment.cgi?id=98302&action=review

> Tools/ChangeLog:20
> +        nrwt: make sharding tests needing locks less hard-coded
> +        https://bugs.webkit.org/show_bug.cgi?id=63112

Is this ChangeLog entry supposed to be here?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:543
>          Return:
>              Two lists of lists of TestInput objects. The first list should
>              only be run under the server lock, the second can be run whenever.

This patch might be the right time to make this return type into a small class.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:629
> +    def _resize_shards(self, shards, max_shards):

What is |shards|?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:633
> +        i = 1
> +        j = 1

Please use more descriptive variable names.  It looks like i is current directory count and j is number of shards.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:637
> +            tests.extend(shard[1])

It's not clear to me what shard[1].  I assume it's part of a tuple (so it's always valid), but I can't tell.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:640
> +                new_shards.append(('locked_shard_%d' % j, tests))

Can we compute j based on the size of new_shards?
Comment 8 Ojan Vafai 2011-06-24 15:28:13 PDT
View in context: https://bugs.webkit.org/attachment.cgi?id=98302&action=review

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:554
> +        if num_workers == 1:
> +            return self._shard_in_two(test_files)
> +        elif fully_parallel:
> +            return self._shard_every_file(test_files)
> +        else:
> +            return self._shard_by_directory(test_files, num_workers)

No need for the else's since you're early returning as per http://www.webkit.org/coding/coding-style.html.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:630
> +        dirs_per_new_shard = math.ceil(len(shards) / max_shards * 1.0)

Why the "* 1.0"?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:636
> +        while shards:
> +            shard = shards.pop()

Can you just do "for shard in shards:"?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:640
> +                new_shards.append(('locked_shard_%d' % j, tests))

Instead of manually keeping track of "j", can you just use len(new_shards)?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:645
> +        if tests:
> +            new_shards.append(('locked_shard_%d' % j, tests))

I think you could simplify this to not need this last case if you did something like:
for shard in shards:
    if not tests:
        new_shards.append(...)
    tests.extend(...)
    i += 1
    if i == dirs_per_new_shard:
        tests = []
Comment 9 Dirk Pranke 2011-06-24 16:26:08 PDT
Comment on attachment 98302 [details]
clean up logic, add tests

View in context: https://bugs.webkit.org/attachment.cgi?id=98302&action=review

>> Tools/ChangeLog:20
>> +        https://bugs.webkit.org/show_bug.cgi?id=63112
> 
> Is this ChangeLog entry supposed to be here?

No. I think this maybe crept in in some merge/rebasing effort. I'll remove it.

>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:543
>>              only be run under the server lock, the second can be run whenever.
> 
> This patch might be the right time to make this return type into a small class.

I started to do this as part of revising the patch per your feedback, but it turns out that this just makes things more complicated: you end up creating a class that holds two fields, each of which is a list, and really all you want to do is extract each list and then add them together, the calling code gets much worse.

However, I think partially this code looks complex because each shard itself is an anonymous data structure. I've changed that so that there is now a TestShard class (a named list of TestInputs) and I think that ends up making things clearer. Let me know what you think.

>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:629
>> +    def _resize_shards(self, shards, max_shards):
> 
> What is |shards|?

I will rewrite everything to be clearer.

>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:633
>> +        j = 1
> 
> Please use more descriptive variable names.  It looks like i is current directory count and j is number of shards.

Will do.

>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:637
>> +            tests.extend(shard[1])
> 
> It's not clear to me what shard[1].  I assume it's part of a tuple (so it's always valid), but I can't tell.

This should become clearer.

>> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:640
>> +                new_shards.append(('locked_shard_%d' % j, tests))
> 
> Can we compute j based on the size of new_shards?

Yeah.
Comment 10 Dirk Pranke 2011-06-24 16:31:12 PDT
Created attachment 98560 [details]
update w/ review feedback from tony
Comment 11 Dirk Pranke 2011-06-24 16:36:58 PDT
(In reply to comment #8)
> View in context: https://bugs.webkit.org/attachment.cgi?id=98302&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:554
> > +        if num_workers == 1:
> > +            return self._shard_in_two(test_files)
> > +        elif fully_parallel:
> > +            return self._shard_every_file(test_files)
> > +        else:
> > +            return self._shard_by_directory(test_files, num_workers)
> 
> No need for the else's since you're early returning as per http://www.webkit.org/coding/coding-style.html.
> 

True.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:630
> > +        dirs_per_new_shard = math.ceil(len(shards) / max_shards * 1.0)
> 
> Why the "* 1.0"?
> 

Otherwise you get an integer division and it truncates down, making the ceil() call useless. 

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:636
> > +        while shards:
> > +            shard = shards.pop()
> 
> Can you just do "for shard in shards:"?
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:640
> > +                new_shards.append(('locked_shard_%d' % j, tests))
> 
> Instead of manually keeping track of "j", can you just use len(new_shards)?
> 

Yeah, I made these two changes in the new patch.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:645
> > +        if tests:
> > +            new_shards.append(('locked_shard_%d' % j, tests))
> 
> I think you could simplify this to not need this last case if you did something like:
> for shard in shards:
>     if not tests:
>         new_shards.append(...)
>     tests.extend(...)
>     i += 1
>     if i == dirs_per_new_shard:
>         tests = []

Hm. I will ponder this one :)
Comment 12 Dirk Pranke 2011-06-24 19:31:58 PDT
Created attachment 98571 [details]
try to make _resize_shards clearer
Comment 13 Tony Chang 2011-06-27 09:41:08 PDT
Comment on attachment 98571 [details]
try to make _resize_shards clearer

View in context: https://bugs.webkit.org/attachment.cgi?id=98571&action=review

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:241
> +    # FIXME: make this class visible, used by workers as well.

Nit: make -> Make

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:244
> +    def __init__(self, name, test_inputs):
> +        self.name = name
> +        self.test_inputs = test_inputs

Nit: Maybe document that test_inputs is a list of TestInputs?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:610
> +        locked_shards = []
> +        unlocked_shards = []

Nit: I would move these lines between the for loop to so it's closer to where it's first used.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:629
> +        locked_shards.sort(key=lambda shard: shard.name)
> +        unlocked_shards.sort(key=lambda shard: shard.name)

Should we implement __cmp__ for TestShard?

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:657
> +            return int(math.ceil(numerator * 1.0 / divisor))

Nit: I think float(numerator) is more explicit than * 1.0.

> Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:759
> +            # FIXME: change 'test_list' to 'shard', make sharding public

Nit: change -> Change and public -> public.
Comment 14 Dirk Pranke 2011-06-27 15:33:06 PDT
(In reply to comment #13)
> (From update of attachment 98571 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=98571&action=review
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:241
> > +    # FIXME: make this class visible, used by workers as well.
> 
> Nit: make -> Make
> 

Done.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:244
> > +    def __init__(self, name, test_inputs):
> > +        self.name = name
> > +        self.test_inputs = test_inputs
> 
> Nit: Maybe document that test_inputs is a list of TestInputs?
>

There's a class-level docstring right above the __init__ that says that the object is a named list of TestInputs; between that and the variable/field names that seemed to cover things well enough, I thought.
 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:610
> > +        locked_shards = []
> > +        unlocked_shards = []
> 
> Nit: I would move these lines between the for loop to so it's closer to where it's first used.
> 
> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:629
> > +        locked_shards.sort(key=lambda shard: shard.name)
> > +        unlocked_shards.sort(key=lambda shard: shard.name)
> 
> Should we implement __cmp__ for TestShard?
>

I wouldn't just yet; I'm not sure __cmp__ makes a lot of sense as a generic operation, and there's no obvious need for it.

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:657
> > +            return int(math.ceil(numerator * 1.0 / divisor))
> 
> Nit: I think float(numerator) is more explicit than * 1.0.
> 

Good call. Done

> > Tools/Scripts/webkitpy/layout_tests/layout_package/manager.py:759
> > +            # FIXME: change 'test_list' to 'shard', make sharding public
> 
> Nit: change -> Change and public -> public.

Done.
Comment 15 Dirk Pranke 2011-06-27 15:35:54 PDT
Created attachment 98804 [details]
update w/ more feedback from tony
Comment 16 Dirk Pranke 2011-06-30 19:55:43 PDT
Created attachment 99420 [details]
patch for landing after rebasing to HEAD
Comment 17 Dirk Pranke 2011-07-08 14:12:54 PDT
Fixed in r90419.
Comment 18 Eric Seidel (no email) 2012-01-10 10:31:12 PST
IMO the right way to fix all of this http sharding is to use port redirection in the DRT instance, and to start the http server with a custom port.  That's more work, but is a better long term solution than any of these locks or hacks we've added in the past. :)
Comment 19 Dirk Pranke 2012-01-10 11:28:52 PST
(In reply to comment #18)
> IMO the right way to fix all of this http sharding is to use port redirection in the DRT instance, and to start the http server with a custom port.  That's more work, but is a better long term solution than any of these locks or hacks we've added in the past. :)

Eric, did you mean for this comment to be on bug 75958 instead? 

At any rate, I think you're conflating two things ... this feature is about being able to run multiple http tests in parallel in a single NRWT run; the reason we don't do that today is because we're concerned about overloading the http server and/or inter-test dependencies. I don't know if running on different http ports would make this better or worse, but it shouldn't be necessary, regardless.

I think you're talking about running multiple http servers on different ports so that multiple separate nrwt runs can occur in parallel, right?