79179 – webkitpy: committers_unittest fuzzy matching is really slow

RESOLVED FIXED 79179

webkitpy: committers_unittest fuzzy matching is really slow

https://bugs.webkit.org/show_bug.cgi?id=79179

Summary webkitpy: committers_unittest fuzzy matching is really slow

Dirk Pranke

Reported 2012-02-21 18:43:42 PST

webkitpy: committers_unittest fuzzy matching is really slow

Attachments
Patch (5.28 KB, patch) 2012-02-21 18:53 PST, Dirk Pranke	no flags	Details Formatted Diff Diff
Patch (22.07 KB, patch) 2012-02-24 13:29 PST, Dirk Pranke	rniwa: review+	Details Formatted Diff Diff
Show Obsolete (1) View All Add attachment proposed patch, testcase, etc.

Dirk Pranke

Comment 1 2012-02-21 18:53:44 PST

Created attachment 128105 [details] Patch

Dirk Pranke

Comment 2 2012-02-21 18:55:52 PST

rniwa ... let me know what you think of this approach. I think I've convinced myself it's safe. Assuming you like it, I will rework the actual code so that instead of calling assert_fuzzy_match() over and over again, we just pass in a list of test cases (much like I did in https://bugs.webkit.org/show_bug.cgi?id=79159 ); I left things this way to minimize the diff so you could better understand that change.

Dirk Pranke

Comment 3 2012-02-21 18:56:32 PST

We can probably do something similar for webkitpy.tool.steps.preparechangelog_unittest.PrepareChangeLogTest.test_fuzzy_reviewer_match as well (haven't really looked yet, but that test is also quite slow).

Ryosuke Niwa

Comment 4 2012-02-21 19:17:18 PST

Comment on attachment 128105 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128105&action=review > Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 > + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c] I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher.

Dirk Pranke

Comment 5 2012-02-21 19:46:05 PST

Comment on attachment 128105 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=128105&action=review >> Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 >> + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c] > > I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher. If I understand you correctly, you want to ensure that 'Darin' will always match 'Darin Adler' and not 'Darin Fisher' as long as 'Darin Adler' as they are both still in the contributor list, right? And since 'Darin Fisher' is not actually listed as one of the tests, below, this case is not adequately covered by my optimization? Presumably we could address that by adding a self._fuzz_match('Darin Fisher', 'Darin Fisher', 0) test case, or have some other mechanism to include "interesting" names in the list to ensure we were covering them. That said, it seems then that we're actually testing two different things here (which is what I feared). The first, is correctness of the fuzzy-matching algorithm (regardless of the contents of committers.py). This could be done with presumably far fewer test cases than we have. The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about). Does that make sense?

Ryosuke Niwa

Comment 6 2012-02-21 22:33:30 PST

(In reply to comment #5) > (From update of attachment 128105 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=128105&action=review > > >> Tools/Scripts/webkitpy/common/config/committers_unittest.py:117 > >> + contributors_used_in_tests = [all_committers.contributor_by_name(c) for c in self._contributors if c] > > > > I don't think this change is safe. A big part of this test is so that we don't mismatch new contributors to alias of old contributors. e.g. darin shouldn't match Darin Fisher. > > If I understand you correctly, you want to ensure that 'Darin' will always match 'Darin Adler' and not 'Darin Fisher' as long as 'Darin Adler' as they are both still in the contributor list, right? And since 'Darin Fisher' is not actually listed as one of the tests, below, this case is not adequately covered by my optimization? Presumably we could address that by adding a self._fuzz_match('Darin Fisher', 'Darin Fisher', 0) test case, or have some other mechanism to include "interesting" names in the list to ensure we were covering them. Right. That sounds like a reasonable solution to me. > That said, it seems then that we're actually testing two different things here (which is what I feared). The first, is correctness of the fuzzy-matching algorithm (regardless of the contents of committers.py). This could be done with presumably far fewer test cases than we have. > > The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? > > It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about). Right. But testing that properly involves including all contributors in the list, no?

Dirk Pranke

Comment 7 2012-02-22 09:05:00 PST

(In reply to comment #6) > > The second is testing for specific resolutions based on the actual (current) contents of the list of contributors, right? > > > > It seems like, if that was the case, it would be a lot clearer to break those out into individual (and different) test cases to be clearer about that (and in that case we should probably list the disambiguation we're worried about). > > Right. But testing that properly involves including all contributors in the list, no? Probably, although I might think about it more, just to see if I've missed something else. If I break things up into, say, one test per person, at least it'll parallelize nicely. We could leave the "algorithm is correct" tests as unit tests and make the "match this person" tests optional integration tests that could be skipped if you were in a hurry (but are run by default).

Ryosuke Niwa

Comment 8 2012-02-22 09:17:49 PST

(In reply to comment #7) > Probably, although I might think about it more, just to see if I've missed something else. If I break things up into, say, one test per person, at least it'll parallelize nicely. We could leave the "algorithm is correct" tests as unit tests and make the "match this person" tests optional integration tests that could be skipped if you were in a hurry (but are run by default). "algorithm is correct" is synonymous to "match this person". I purposefully avoided white-box testing. Anyway, I'm fine with skipping the entire test by default since there's no significant user of this function at the moment (changelog parser uses but it's still in the development).

Dirk Pranke

Comment 9 2012-02-24 13:29:13 PST

Created attachment 128795 [details] Patch

Eric Seidel (no email)

Comment 10 2012-02-24 13:30:47 PST

Comment on attachment 128795 [details] Patch So you're splitting these down so that we can more easily shard this test?

Dirk Pranke

Comment 11 2012-02-24 13:34:18 PST

rniwa - please take another look? I've left the original logic alone but chunked the tests up so that most either test a specific name or take less than ~1/3 of a second (which is good enough for me for now since I can run them all in parallel. Also, we had a few duplicate tests that were slowing things down ... single threaded, the file now takes ~4.5 seconds; in parallel I can run it in under 1.

Dirk Pranke

Comment 12 2012-02-24 13:36:22 PST

(In reply to comment #10) > (From update of attachment 128795 [details]) > So you're splitting these down so that we can more easily shard this test? From the ChangeLog: + Break the fuzzy matching tests into individual routines for + each contributor so that the intent is a little clearer, so + that it's easier to test individual names (and identify + duplicate tests), and so that we can eventually run + them in parallel. Yup :) We can't easily speed up each individual match as long as we want to run the match against the entire list of committers :( We could mock out lists of committers (or provide a shorter list), but that won't achieve the testing goal rniwa wants.

Eric Seidel (no email)

Comment 13 2012-02-24 13:46:49 PST

(In reply to comment #12) > (In reply to comment #10) > > (From update of attachment 128795 [details] [details]) > > So you're splitting these down so that we can more easily shard this test? > > From the ChangeLog: > > + Break the fuzzy matching tests into individual routines for > + each contributor so that the intent is a little clearer, so > + that it's easier to test individual names (and identify > + duplicate tests), and so that we can eventually run > + them in parallel. > > Yup :) We can't easily speed up each individual match as long as we want to run the match against the entire list of committers :( We could mock out lists of committers (or provide a shorter list), but that won't achieve the testing goal rniwa wants. I can definitely see how these should be viewed as integration tests then. :)

Dirk Pranke

Comment 14 2012-02-24 16:19:27 PST

Committed r108864: <http://trac.webkit.org/changeset/108864>

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component New Bugs

Assignee

Dirk Pranke

Reported

2012-02-21 18:43 PST

Modified

2012-02-24 16:19 PST History

CC List

5 users Show

URL

Keywords

Depends on

Blocks