WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
79179
webkitpy: committers_unittest fuzzy matching is really slow
https://bugs.webkit.org/show_bug.cgi?id=79179
Summary
webkitpy: committers_unittest fuzzy matching is really slow
Dirk Pranke
Reported
2012-02-21 18:43:42 PST
webkitpy: committers_unittest fuzzy matching is really slow
Attachments
Patch
(5.28 KB, patch)
2012-02-21 18:53 PST
,
Dirk Pranke
no flags
Details
Formatted Diff
Diff
Patch
(22.07 KB, patch)
2012-02-24 13:29 PST
,
Dirk Pranke
rniwa
: review+
Details
Formatted Diff
Diff
Show Obsolete
(1)
View All
Add attachment
proposed patch, testcase, etc.
Dirk Pranke
Comment 1
2012-02-21 18:53:44 PST
Created
attachment 128105
[details]
Patch
Dirk Pranke
Comment 2
2012-02-21 18:55:52 PST
rniwa ... let me know what you think of this approach. I think I've convinced myself it's safe. Assuming you like it, I will rework the actual code so that instead of calling assert_fuzzy_match() over and over again, we just pass in a list of test cases (much like I did in
https://bugs.webkit.org/show_bug.cgi?id=79159
); I left things this way to minimize the diff so you could better understand that change.
Dirk Pranke
Comment 3
2012-02-21 18:56:32 PST
We can probably do something similar for webkitpy.tool.steps.preparechangelog_unittest.PrepareChangeLogTest.test_fuzzy_reviewer_match as well (haven't really looked yet, but that test is also quite slow).
Ryosuke Niwa
Comment 4
2012-02-21 19:17:18 PST
Comment on
attachment 128105
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=128105&action=review
> Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 > + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c]
I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher.
Dirk Pranke
Comment 5
2012-02-21 19:46:05 PST
Comment on
attachment 128105
[details]
Patch View in context:
https://bugs.webkit.org/attachment.cgi?id=128105&action=review
>> Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 >> + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c] > > I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher.
If I understand you correctly, you want to ensure that 'Darin' will always match 'Darin Adler' and not 'Darin Fisher' as long as 'Darin Adler' as they are both still in the contributor list, right? And since 'Darin Fisher' is not actually listed as one of the tests, below, this case is not adequately covered by my optimization? Presumably we could address that by adding a self._fuzz_match('Darin Fisher', 'Darin Fisher', 0) test case, or have some other mechanism to include "interesting" names in the list to ensure we were covering them. That said, it seems then that we're actually testing two different things here (which is what I feared). The first, is correctness of the fuzzy-matching algorithm (regardless of the contents of committers.py). This could be done with presumably far fewer test cases than we have. The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about). Does that make sense?
Ryosuke Niwa
Comment 6
2012-02-21 22:33:30 PST
(In reply to
comment #5
)
> (From update of
attachment 128105
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=128105&action=review
> > >> Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 > >> + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c] > > > > I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher. > > If I understand you correctly, you want to ensure that 'Darin' will always match 'Darin Adler' and not 'Darin Fisher' as long as 'Darin Adler' as they are both still in the contributor list, right? And since 'Darin Fisher' is not actually listed as one of the tests, below, this case is not adequately covered by my optimization? Presumably we could address that by adding a self._fuzz_match('Darin Fisher', 'Darin Fisher', 0) test case, or have some other mechanism to include "interesting" names in the list to ensure we were covering them.
Right. That sounds like a reasonable solution to me.
> That said, it seems then that we're actually testing two different things here (which is what I feared). The first, is correctness of the fuzzy-matching algorithm (regardless of the contents of committers.py). This could be done with presumably far fewer test cases than we have. > > The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? > > It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about).
Right. But testing that properly involves including all contributors in the list, no?
Dirk Pranke
Comment 7
2012-02-22 09:05:00 PST
(In reply to
comment #6
)
> > The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? > > > > It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about). > > Right. But testing that properly involves including all contributors in the list, no?
Probably, although I might think about it more, just to see if I've missed something else. If I break things up into, say, one test per person, at least it'll parallelize nicely. We could leave the "algorithm is correct" tests as unit tests and make the "match this person" tests optional integration tests that could be skipped if you were in a hurry (but are run by default).
Ryosuke Niwa
Comment 8
2012-02-22 09:17:49 PST
(In reply to
comment #7
)
> Probably, although I might think about it more, just to see if I've missed something else. If I break things up into, say, one test per person, at least it'll parallelize nicely. We could leave the "algorithm is correct" tests as unit tests and make the "match this person" tests optional integration tests that could be skipped if you were in a hurry (but are run by default).
"algorithm is correct" is synonymous to "match this person". I purposefully avoided white-box testing. Anyway, I'm fine with skipping the entire test by default since there's no significant user of this function at the moment (changelog parser uses but it's still in the development).
Dirk Pranke
Comment 9
2012-02-24 13:29:13 PST
Created
attachment 128795
[details]
Patch
Eric Seidel (no email)
Comment 10
2012-02-24 13:30:47 PST
Comment on
attachment 128795
[details]
Patch So you're splitting these down so that we can more easily shard this test?
Dirk Pranke
Comment 11
2012-02-24 13:34:18 PST
rniwa - please take another look? I've left the original logic alone but chunked the tests up so that most either test a specific name or take less than ~1/3 of a second (which is good enough for me for now since I can run them all in parallel. Also, we had a few duplicate tests that were slowing things down ... single threaded, the file now takes ~4.5 seconds; in parallel I can run it in under 1.
Dirk Pranke
Comment 12
2012-02-24 13:36:22 PST
(In reply to
comment #10
)
> (From update of
attachment 128795
[details]
) > So you're splitting these down so that we can more easily shard this test?
From the ChangeLog: + Break the fuzzy matching tests into individual routines for + each contributor so that the intent is a little clearer, so + that it's easier to test individual names (and identify + duplicate tests), and so that we can eventually run + them in parallel. Yup :) We can't easily speed up each individual match as long as we want to run the match against the entire list of committers :( We could mock out lists of committers (or provide a shorter list), but that won't achieve the testing goal rniwa wants.
Eric Seidel (no email)
Comment 13
2012-02-24 13:46:49 PST
(In reply to
comment #12
)
> (In reply to
comment #10
) > > (From update of
attachment 128795
[details]
[details]) > > So you're splitting these down so that we can more easily shard this test? > > From the ChangeLog: > > + Break the fuzzy matching tests into individual routines for > + each contributor so that the intent is a little clearer, so > + that it's easier to test individual names (and identify > + duplicate tests), and so that we can eventually run > + them in parallel. > > Yup :) We can't easily speed up each individual match as long as we want to run the match against the entire list of committers :( We could mock out lists of committers (or provide a shorter list), but that won't achieve the testing goal rniwa wants.
I can definitely see how these should be viewed as integration tests then. :)
Dirk Pranke
Comment 14
2012-02-24 16:19:27 PST
Committed
r108864
: <
http://trac.webkit.org/changeset/108864
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug