84582 – webkitpy: changelog crashes when parsing authors containing + or ,

Carlos Garcia Campos

Reported 2012-04-23 03:11:49 PDT

E...... ====================================================================== ERROR: test_parse_authors (webkitpy.common.checkout.changelog_unittest.ChangeLogTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "webkitpy/common/checkout/changelog_unittest.py", line 433, in test_parse_authors [('James Simonsen', 'simonjam+webkit@google.com')]) File "webkitpy/common/checkout/changelog_unittest.py", line 417, in _assert_parse_authors parsed_authors = [(author['name'], author['email']) for author in self._entry_with_author(author_text).authors()] File "webkitpy/common/checkout/changelog_unittest.py", line 376, in _entry_with_author '''.replace("AUTHOR_TEXT", author_text)) File "webkitpy/common/checkout/changelog.py", line 104, in __init__ self._parse_entry() File "webkitpy/common/checkout/changelog.py", line 165, in _parse_entry self._authors = ChangeLogEntry._parse_author_text(self._author_text) File "webkitpy/common/checkout/changelog.py", line 156, in _parse_author_text return [ChangeLogEntry._parse_author_name_and_email(author) for author in authors] File "webkitpy/common/checkout/changelog.py", line 148, in _parse_author_name_and_email return {'name': match.group("name"), 'email': match.group("email")} AttributeError: 'NoneType' object has no attribute 'group' ---------------------------------------------------------------------- Ran 16 tests in 0.566s FAILED (errors=1) The problem is that _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when tey are found at any position of the text. Author names can contains ',' and email can contain +.

Carlos Garcia Campos

Comment 1 2012-04-23 03:16:08 PDT

Created attachment 138318 [details] Patch

Martin Robinson

Comment 2 2012-04-23 07:57:22 PDT

Comment on attachment 138318 [details] Patch I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem.

Carlos Garcia Campos

Comment 3 2012-04-23 08:10:38 PDT

(In reply to comment #2) > (From update of attachment 138318 [details]) > I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem. As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators.

Martin Robinson

Comment 4 2012-04-23 08:14:14 PDT

(In reply to comment #3) > As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators. Sorry, I missed that. Would it be possible to modify the regular expression to simply enforce spaces around '+'? That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc.

Carlos Garcia Campos

Comment 5 2012-04-23 08:19:08 PDT

(In reply to comment #4) > (In reply to comment #3) > > > As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators. > > Sorry, I missed that. > > Would it be possible to modify the regular expression to simply enforce spaces around '+'? That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc. That would fix the case of +, but list of authors separated by , don't usually have spaces around. Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org>

Martin Robinson

Comment 6 2012-04-23 08:42:44 PDT

(In reply to comment #5) > That would fix the case of +, but list of authors separated by , don't usually have spaces around. > > Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org> It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+".

Carlos Garcia Campos

Comment 7 2012-04-23 08:53:40 PDT

(In reply to comment #6) > (In reply to comment #5) > > > That would fix the case of +, but list of authors separated by , don't usually have spaces around. > > > > Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org> > > It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+". that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases.

Martin Robinson

Comment 8 2012-04-23 11:25:11 PDT

(In reply to comment #7) > that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases. The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places.

Dirk Pranke

Comment 9 2012-04-23 14:45:16 PDT

Comment on attachment 138318 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review > Tools/Scripts/webkitpy/common/checkout/changelog.py:147 > + return [{'name': name, 'email': email} for name, email in re.findall(r'(?P<name>.+?)\s+<(?P<email>[^>]+)>(?:\s*(?:,(?:\s+and\s+|&)?|(?:^|\s+)and\s+|&&|[/+&])\s*)?', text) if text] Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. I'd be surprised if this code is particularly performance critical, so if it isn't, I would suggest splitting this (or at least _split_contributor_names() ) up into multiple regexps bracketed by if/elses in order to make the code easier to comprehend.

Martin Robinson

Comment 10 2012-04-23 15:01:05 PDT

(In reply to comment #9) > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." You could make it work for the attribution line by only splitting names if the comma is prefixed by the end of an email address ala "Pawel Hadjan, Jr. <foo@foo.com>, Martin Robinson <mrobinson@webkit.org>, and Carlos Garcia Campos <cgarcia@igalia.com>." That means we'd need two copies of the regex though, I think. Perhaps it's easiest to just strip ", Jr." and the like from the string before processing. That would make things a lot simpler.

Dirk Pranke

Comment 11 2012-04-23 15:15:26 PDT

(In reply to comment #10) > (In reply to comment #9) > > > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." Isn't that case is hard to deal with regardless of whether the regexps are merged?

Martin Robinson

Comment 12 2012-04-23 15:46:43 PDT

(In reply to comment #11) > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." > > Isn't that case is hard to deal with regardless of whether the regexps are merged? It seems so. :/

Carlos Garcia Campos

Comment 13 2012-04-23 23:40:30 PDT

(In reply to comment #8) > (In reply to comment #7) > > > that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases. > > The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places. Yes, that makes sense

Carlos Garcia Campos

Comment 14 2012-04-23 23:45:27 PDT

(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) > > > > > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. > > > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." > > Isn't that case is hard to deal with regardless of whether the regexps are merged? Yes, this patch doesn't try to fix the reviewer text case, because it's impossible to know whether a , is part of a name or a separator. Fortunately multiple reviewers is not that common, so having a commit with multiple reviewers and one of them having a , in the name is very unlikely. Commits with multiple author is common though. I tried to fix the test cases I added to the unit tests using the existing functions, just playing with the regexp, but didn't manage to fix it until I merged both regexps, because we need the pattern name <email> in the regexp.

Eric Seidel (no email)

Comment 15 2012-08-22 15:26:45 PDT

Ryosuke was the last person in this code. :)

Ryosuke Niwa

Comment 16 2012-08-22 15:38:50 PDT

Comment on attachment 138318 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review > Tools/Scripts/webkitpy/common/checkout/changelog.py:-154 > - authors = ChangeLogEntry._split_contributor_names(text) I don't think we want to duplicate reg exps here.

Carlos Garcia Campos

Comment 17 2013-04-10 00:36:29 PDT

Tony fixed this already in r143504. *** This bug has been marked as a duplicate of bug 110356 ***

Attachments
Patch (4.63 KB, patch) 2012-04-23 03:16 PDT, Carlos Garcia Campos	rniwa: review-	Details Formatted Diff Diff
View All Add attachment proposed patch, testcase, etc.