WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
72351
Implement edit-distance based reviewer recognition algorithm
https://bugs.webkit.org/show_bug.cgi?id=72351
Summary
Implement edit-distance based reviewer recognition algorithm
Ryosuke Niwa
Reported
2011-11-14 21:10:57 PST
Now that we have fixed the
bug 72340
, we're in the place to dramatically improve the recognizability of reviewer names. Be hold!, this algorithm can understand names like "Adam Barth.:w" and Sammy Weinig.
Attachments
Patch
(27.81 KB, patch)
2011-11-14 21:26 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Fixed style
(27.81 KB, patch)
2011-11-14 21:39 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Removed unsed argument in edit_distance
(27.79 KB, patch)
2011-11-14 21:46 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Added _assert_fuzz_match
(26.73 KB, patch)
2011-11-15 10:44 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Fixed per Eric's comments
(21.48 KB, patch)
2011-11-15 16:00 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Brought back forgotten editdistance.py
(26.31 KB, patch)
2011-11-15 16:12 PST
,
Ryosuke Niwa
eric
: review+
Details
Formatted Diff
Diff
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Ryosuke Niwa
Comment 1
2011-11-14 21:26:29 PST
Created
attachment 115093
[details]
Patch
WebKit Review Bot
Comment 2
2011-11-14 21:30:09 PST
Attachment 115093
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Tools/ChangeLog', u'Tools/Scripts/webkitpy..." exit_code: 1 Tools/Scripts/webkitpy/common/edit_distance_unittest.py:33: expected 2 blank lines, found 1 [pep8/E302] [5] Total errors found: 1 in 8 files If any of these errors are false positives, please file a bug against check-webkit-style.
Ryosuke Niwa
Comment 3
2011-11-14 21:39:06 PST
Created
attachment 115094
[details]
Fixed style
Ryosuke Niwa
Comment 4
2011-11-14 21:46:54 PST
Created
attachment 115095
[details]
Removed unsed argument in edit_distance
Eric Seidel (no email)
Comment 5
2011-11-15 01:27:24 PST
Comment on
attachment 115095
[details]
Removed unsed argument in edit_distance View in context:
https://bugs.webkit.org/attachment.cgi?id=115095&action=review
> Tools/Scripts/webkitpy/common/config/committers_unittest.py:97 > + self.assertEqual(committers.contributors_by_fuzzy_match('Geoff Garen'), ([committers.contributor_by_name('Geoffrey Garen')], 3))
I would have used an _assert_fuzzy_match helper here.
Ryosuke Niwa
Comment 6
2011-11-15 01:30:28 PST
Comment on
attachment 115095
[details]
Removed unsed argument in edit_distance View in context:
https://bugs.webkit.org/attachment.cgi?id=115095&action=review
>> Tools/Scripts/webkitpy/common/config/committers_unittest.py:97 >> + self.assertEqual(committers.contributors_by_fuzzy_match('Geoff Garen'), ([committers.contributor_by_name('Geoffrey Garen')], 3)) > > I would have used an _assert_fuzzy_match helper here.
So the problem I had with that approach is that when the assertion fails, the first thing I see is: self.assertEqual(<whatever>, expected). This approach has an advantage that committers.contributors_by_fuzzy_match('Geoff Garen') and ([committers.contributor_by_name('Geoffrey Garen')], 3) both show up on the last line when the assertion fails.
Eric Seidel (no email)
Comment 7
2011-11-15 02:01:58 PST
(In reply to
comment #6
)
> (From update of
attachment 115095
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=115095&action=review
> > >> Tools/Scripts/webkitpy/common/config/committers_unittest.py:97 > >> + self.assertEqual(committers.contributors_by_fuzzy_match('Geoff Garen'), ([committers.contributor_by_name('Geoffrey Garen')], 3)) > > > > I would have used an _assert_fuzzy_match helper here. > > So the problem I had with that approach is that when the assertion fails, the first thing I see is: > self.assertEqual(<whatever>, expected).
Sure... but hte second to-last frame would have the arguments you want. :)
Eric Seidel (no email)
Comment 8
2011-11-15 03:18:26 PST
Comment on
attachment 115095
[details]
Removed unsed argument in edit_distance View in context:
https://bugs.webkit.org/attachment.cgi?id=115095&action=review
> Tools/Scripts/webkitpy/common/edit_distance.py:38 > + distances = [array('h', (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)] > + for i in range(1, len(str1) + 1): > + distances[i][0] = i > + > + for j in range(1, len(str2) + 1): > + distances[0][j] = j
I'm really not a big fan of one letter variable names.
Ryosuke Niwa
Comment 9
2011-11-15 10:40:42 PST
(In reply to
comment #8
)
> (From update of
attachment 115095
[details]
) > View in context:
https://bugs.webkit.org/attachment.cgi?id=115095&action=review
> > > Tools/Scripts/webkitpy/common/edit_distance.py:38 > > + distances = [array('h', (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)] > > + for i in range(1, len(str1) + 1): > > + distances[i][0] = i > > + > > + for j in range(1, len(str2) + 1): > > + distances[0][j] = j > > I'm really not a big fan of one letter variable names.
Do you have a suggestion on what to call them? I mean they're indexes in str1 and str2 so I could call it index1 and index2 but I don't think that clarify anything here.
Ryosuke Niwa
Comment 10
2011-11-15 10:44:40 PST
Created
attachment 115198
[details]
Added _assert_fuzz_match
Eric Seidel (no email)
Comment 11
2011-11-15 12:03:25 PST
Comment on
attachment 115198
[details]
Added _assert_fuzz_match View in context:
https://bugs.webkit.org/attachment.cgi?id=115198&action=review
I think this needs another round. Mostly for aesthetics.
> Tools/Scripts/webkitpy/common/edit_distance.py:32 > +def edit_distance(str1, str2):
Did you write this yourself? It doesn't qutie feel like your style.
> Tools/Scripts/webkitpy/common/edit_distance.py:33 > + distances = [array('h', (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)]
I'm confused what the "h" is for?
> Tools/Scripts/webkitpy/common/edit_distance.py:38 > + for i in range(1, len(str1) + 1): > + distances[i][0] = i > + > + for j in range(1, len(str2) + 1): > + distances[0][j] = j
Why is this all 1-based?
> Tools/Scripts/webkitpy/common/edit_distance.py:44 > + distances[i][j] = min(distances[i][j - 1] + 1, distances[i - 1][j] + 1, distances[i - 1][j - 1] + diff)
It seems like we have more -1 than we do plain usage of i/j. I assuem that's because the above array's are one-based?
> Tools/Scripts/webkitpy/common/edit_distance_unittest.py:36 > + self.assertEqual(edit_distance('', 'aa'), 2)
I even write things like self._assert_edit_distance('', 'aa', 2) here. But this is also OK.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:163 > + if self._reviewers_text_list: > + list_of_reviewers = [self._committer_list.contributors_by_fuzzy_match(reviewer)[0] for reviewer in self._reviewers_text_list] > + self._reviewers = [reviewers[0] for reviewers in list_of_reviewers if len(reviewers) == 1] > + else: > + self._reviewers = []
I think this if empty-string check makes more sense inside contributors_by_fuzzy_match? I'm also not sure why you're stripping the list container off of _reviewers in the case where there is only one reviewer?
> Tools/Scripts/webkitpy/common/checkout/changelog_unittest.py:420 > + def _contributors(self, names): > + return [CommitterList().contributor_by_name(name) for name in names]
I'm not sure we want to make this test depend on our ContributorList directly. If we don't have to test that part using real names, we should avoid it. But it's also OK, since 1. the contributor list is very static data, and 2 lots of other tests aready depend on it, sadly.
> Tools/Scripts/webkitpy/common/checkout/changelog_unittest.py:429 > + self.assertEquals(ChangeLogEntry._parse_reviewer_text('Reviewed by Mac build fix')[1], ['Mac build fix'])
Likewise you could use an _assert_reviewer_text("Reviewed by Mac build fix", "Mac build fix") here, although obviously that wouldn't work for all the cases you're testing, but a few such helpers could make this wall of text more readable.
> Tools/Scripts/webkitpy/common/checkout/changelog_unittest.py:454 > + self.assertEqual(self._entry_with_reviewer("Reviewed by Eric Seidel.").has_valid_reviewer(), True)
Wow, this would be much clearer with a helper. self._assert_has_valid_reviewer("Reviewed by Eric Seidel", True) I'm not sure why you're opposed to helper functions in test cases? :p
> Tools/Scripts/webkitpy/common/config/committers.py:32 > +from webkitpy.common.edit_distance import edit_distance
Historically we've avoided having _ in module names to match python style (outputcapture.py, filesystem.py, etc.)
> Tools/Scripts/webkitpy/common/config/committers.py:536 > + # First path, optimitically match for fullname, email and irc_nicknames > + for contributor in self.contributors(): > + if string in [email.lower() for email in contributor.emails] or string == contributor.full_name.lower():
I would consider breaking these blocks into helper functions. 1. to make the management of hte return tuple-easier. (The first helpr doesn't have to care about the tuple-ness). 2. to make it easier to test (or pref-test) halves of it.
> Tools/Scripts/webkitpy/common/config/committers.py:553 > + if len(tokens):
You can use "if tokens" to accomplish the same with 5 fewer chars if you like.
Ryosuke Niwa
Comment 12
2011-11-15 13:33:57 PST
Comment on
attachment 115198
[details]
Added _assert_fuzz_match View in context:
https://bugs.webkit.org/attachment.cgi?id=115198&action=review
>> Tools/Scripts/webkitpy/common/edit_distance.py:32 >> +def edit_distance(str1, str2): > > Did you write this yourself? It doesn't qutie feel like your style.
Yup.
>> Tools/Scripts/webkitpy/common/edit_distance.py:33 >> + distances = [array('h', (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)] > > I'm confused what the "h" is for?
Stands for unsigned short.
http://docs.python.org/library/array.html
>> Tools/Scripts/webkitpy/common/edit_distance.py:38 >> + distances[0][j] = j > > Why is this all 1-based?
The reason I'm starting from 1 is to avoid doing distances[0][0]=0 twice (and thrice if we include the initialization). But we need to initialize distances[len(str1)][0] and distances[0][len(str2)].
>> Tools/Scripts/webkitpy/common/edit_distance.py:44
> > It seems like we have more -1 than we do plain usage of i/j. I assuem that's because the above array's are one-based?
The matrix is 0-based. I need len(str1) + 1 columns and len(str2) + 1 rows.
>> Tools/Scripts/webkitpy/common/checkout/changelog.py:163 >> + self._reviewers = [] > > I think this if empty-string check makes more sense inside contributors_by_fuzzy_match? > > I'm also not sure why you're stripping the list container off of _reviewers in the case where there is only one reviewer?
When there are multiple reviewers that are same distance away, it's inconclusive as to who should be the best match.
Eric Seidel (no email)
Comment 13
2011-11-15 13:40:38 PST
Comment on
attachment 115198
[details]
Added _assert_fuzz_match View in context:
https://bugs.webkit.org/attachment.cgi?id=115198&action=review
>>> Tools/Scripts/webkitpy/common/edit_distance.py:33 >>> + distances = [array('h', (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)] >> >> I'm confused what the "h" is for? > > Stands for unsigned short.
http://docs.python.org/library/array.html
Wow. Amazing that they don't have constants on Array for that. :)
>>> Tools/Scripts/webkitpy/common/checkout/changelog.py:163
>> >> I think this if empty-string check makes more sense inside contributors_by_fuzzy_match? >> >> I'm also not sure why you're stripping the list container off of _reviewers in the case where there is only one reviewer? > > When there are multiple reviewers that are same distance away, it's inconclusive as to who should be the best match.
I see. You're only flattening the array of arrays by one in the case where there is one reviewer in any one of the reviewer lists?
Ryosuke Niwa
Comment 14
2011-11-15 14:11:03 PST
(In reply to
comment #13
)
> I see. You're only flattening the array of arrays by one in the case where there is one reviewer in any one of the reviewer lists?
Right. Perhaps I should have a helper function for this...
Ryosuke Niwa
Comment 15
2011-11-15 16:00:38 PST
Created
attachment 115269
[details]
Fixed per Eric's comments
WebKit Review Bot
Comment 16
2011-11-15 16:03:27 PST
Attachment 115269
[details]
did not pass style-queue: Failed to run "['Tools/Scripts/check-webkit-style', '--diff-files', u'Tools/ChangeLog', u'Tools/Scripts/webkitpy..." exit_code: 1 Traceback (most recent call last): File "Tools/Scripts/check-webkit-style", line 51, in <module> from webkitpy.common.checkout.scm import detect_scm_system File "/mnt/git/webkit-style-queue/Tools/Scripts/webkitpy/common/checkout/__init__.py", line 3, in <module> from .checkout import Checkout File "/mnt/git/webkit-style-queue/Tools/Scripts/webkitpy/common/checkout/checkout.py", line 32, in <module> from webkitpy.common.checkout.changelog import ChangeLog, parse_bug_id_from_changelog File "/mnt/git/webkit-style-queue/Tools/Scripts/webkitpy/common/checkout/changelog.py", line 37, in <module> from webkitpy.common.config.committers import CommitterList File "/mnt/git/webkit-style-queue/Tools/Scripts/webkitpy/common/config/committers.py", line 32, in <module> from webkitpy.common.editdistance import edit_distance ImportError: No module named editdistance If any of these errors are false positives, please file a bug against check-webkit-style.
Ryosuke Niwa
Comment 17
2011-11-15 16:12:14 PST
Created
attachment 115270
[details]
Brought back forgotten editdistance.py
Eric Seidel (no email)
Comment 18
2011-11-15 16:31:07 PST
Comment on
attachment 115270
[details]
Brought back forgotten editdistance.py View in context:
https://bugs.webkit.org/attachment.cgi?id=115270&action=review
Looks great, as long as the reviewer() comment below isn't a bug.
> Tools/Scripts/webkitpy/common/editdistance.py:36 > +def edit_distance(str1, str2): > + unsignedShort = 'h' > + distances = [array(unsignedShort, (0,) * (len(str2) + 1)) for i in range(0, len(str1) + 1)] > + for i in range(1, len(str1) + 1): > + distances[i][0] = i
You might want to add a comment here to explain what you're doing. In my 3rd reading it's still not immediately clear what this does. A link to a wiki page describing the algorithm might also be helpful.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:150 > + def _fuzz_match_reviewers(self, reviewers_text_list): > + if not reviewers_text_list:
Much nicer!
> Tools/Scripts/webkitpy/common/checkout/changelog.py:186 > - return self._reviewer # Might be None, might also not be a Reviewer! > + return self._reviewers[0] if len(self._reviewers) > 0 else None
So won't this sometimes return an array? If your flattening fails because there are some lists of reviewers with more than one reviewer?
> Tools/Scripts/webkitpy/common/checkout/changelog.py:199 > + if re.search("unreviewed", self._contents, re.IGNORECASE): > + return True > + return False
Isn't this just bool(re.search...?
> Tools/Scripts/webkitpy/common/checkout/changelog_unittest.py:430 > + def test_fuzzy_reviewer_match(self): > + self._assert_fuzzy_reviewer_match('Reviewed by Dimitri Glazkov, build fix', ['Dimitri Glazkov', 'build fix'], ['Dimitri Glazkov']) > + self._assert_fuzzy_reviewer_match('Reviewed by BUILD FIX', ['BUILD FIX'], []) > + self._assert_fuzzy_reviewer_match('Reviewed by Mac build fix', ['Mac build fix'], [])
Thank you. :)
> Tools/Scripts/webkitpy/common/config/committers.py:524 > + #FIXME: This should do case-insensitive comparison or assert that all IRC nicknames are in lowercase
I think there is normaly a space between # and FIXME
> Tools/Scripts/webkitpy/common/config/committers.py:539 > + def contributors_by_fuzzy_match(self, string):
MIND. BLOWN. sheriff-bot whois is drooling. :)
Ryosuke Niwa
Comment 19
2011-11-15 17:11:35 PST
Comment on
attachment 115270
[details]
Brought back forgotten editdistance.py View in context:
https://bugs.webkit.org/attachment.cgi?id=115270&action=review
Thanks for the review. I'm very excited to land this patch!
>> Tools/Scripts/webkitpy/common/editdistance.py:36 >> + distances[i][0] = i > > You might want to add a comment here to explain what you're doing. In my 3rd reading it's still not immediately clear what this does. A link to a wiki page describing the algorithm might also be helpful.
I've added some comments here but I didn't include a link to wiki page because the pseudocode listed on wikipedia has a bug if I understand it correctly.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:153 > + return [reviewers[0] for reviewers in list_of_reviewers if len(reviewers) == 1]
I added a comment: # Flatten lists and get rid of any reviewers with more than one candidate.
>> Tools/Scripts/webkitpy/common/checkout/changelog.py:186 >> + return self._reviewers[0] if len(self._reviewers) > 0 else None > > So won't this sometimes return an array? If your flattening fails because there are some lists of reviewers with more than one reviewer?
No. reviewers is always flattened. flatten always succeeds. I just trim reviewers with more than one candidate. Clarified in _fuzz_match_reviewers by adding a comment.
>> Tools/Scripts/webkitpy/common/checkout/changelog.py:199 >> + return False > > Isn't this just bool(re.search...?
Yes. Fixed.
>> Tools/Scripts/webkitpy/common/config/committers.py:524 >> + #FIXME: This should do case-insensitive comparison or assert that all IRC nicknames are in lowercase > > I think there is normaly a space between # and FIXME
Fixed.
Ryosuke Niwa
Comment 20
2011-11-15 22:41:22 PST
Committed
r100407
: <
http://trac.webkit.org/changeset/100407
>
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug