RESOLVED DUPLICATE of bug 11035684582
webkitpy: changelog crashes when parsing authors containing + or ,
https://bugs.webkit.org/show_bug.cgi?id=84582
Summary webkitpy: changelog crashes when parsing authors containing + or ,
Carlos Garcia Campos
Reported 2012-04-23 03:11:49 PDT
E...... ====================================================================== ERROR: test_parse_authors (webkitpy.common.checkout.changelog_unittest.ChangeLogTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "webkitpy/common/checkout/changelog_unittest.py", line 433, in test_parse_authors [('James Simonsen', 'simonjam+webkit@google.com')]) File "webkitpy/common/checkout/changelog_unittest.py", line 417, in _assert_parse_authors parsed_authors = [(author['name'], author['email']) for author in self._entry_with_author(author_text).authors()] File "webkitpy/common/checkout/changelog_unittest.py", line 376, in _entry_with_author '''.replace("AUTHOR_TEXT", author_text)) File "webkitpy/common/checkout/changelog.py", line 104, in __init__ self._parse_entry() File "webkitpy/common/checkout/changelog.py", line 165, in _parse_entry self._authors = ChangeLogEntry._parse_author_text(self._author_text) File "webkitpy/common/checkout/changelog.py", line 156, in _parse_author_text return [ChangeLogEntry._parse_author_name_and_email(author) for author in authors] File "webkitpy/common/checkout/changelog.py", line 148, in _parse_author_name_and_email return {'name': match.group("name"), 'email': match.group("email")} AttributeError: 'NoneType' object has no attribute 'group' ---------------------------------------------------------------------- Ran 16 tests in 0.566s FAILED (errors=1) The problem is that _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when tey are found at any position of the text. Author names can contains ',' and email can contain +.
Attachments
Patch (4.63 KB, patch)
2012-04-23 03:16 PDT, Carlos Garcia Campos
rniwa: review-
Carlos Garcia Campos
Comment 1 2012-04-23 03:16:08 PDT
Martin Robinson
Comment 2 2012-04-23 07:57:22 PDT
Comment on attachment 138318 [details] Patch I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem.
Carlos Garcia Campos
Comment 3 2012-04-23 08:10:38 PDT
(In reply to comment #2) > (From update of attachment 138318 [details]) > I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem. As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators.
Martin Robinson
Comment 4 2012-04-23 08:14:14 PDT
(In reply to comment #3) > As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators. Sorry, I missed that. Would it be possible to modify the regular expression to simply enforce spaces around '+'? That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc.
Carlos Garcia Campos
Comment 5 2012-04-23 08:19:08 PDT
(In reply to comment #4) > (In reply to comment #3) > > > As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators. > > Sorry, I missed that. > > Would it be possible to modify the regular expression to simply enforce spaces around '+'? That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc. That would fix the case of +, but list of authors separated by , don't usually have spaces around. Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org>
Martin Robinson
Comment 6 2012-04-23 08:42:44 PDT
(In reply to comment #5) > That would fix the case of +, but list of authors separated by , don't usually have spaces around. > > Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org> It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+".
Carlos Garcia Campos
Comment 7 2012-04-23 08:53:40 PDT
(In reply to comment #6) > (In reply to comment #5) > > > That would fix the case of +, but list of authors separated by , don't usually have spaces around. > > > > Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org> > > It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+". that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases.
Martin Robinson
Comment 8 2012-04-23 11:25:11 PDT
(In reply to comment #7) > that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases. The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places.
Dirk Pranke
Comment 9 2012-04-23 14:45:16 PDT
Comment on attachment 138318 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review > Tools/Scripts/webkitpy/common/checkout/changelog.py:147 > + return [{'name': name, 'email': email} for name, email in re.findall(r'(?P<name>.+?)\s+<(?P<email>[^>]+)>(?:\s*(?:,(?:\s+and\s+|&)?|(?:^|\s+)and\s+|&&|[/+&])\s*)?', text) if text] Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. I'd be surprised if this code is particularly performance critical, so if it isn't, I would suggest splitting this (or at least _split_contributor_names() ) up into multiple regexps bracketed by if/elses in order to make the code easier to comprehend.
Martin Robinson
Comment 10 2012-04-23 15:01:05 PDT
(In reply to comment #9) > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." You could make it work for the attribution line by only splitting names if the comma is prefixed by the end of an email address ala "Pawel Hadjan, Jr. <foo@foo.com>, Martin Robinson <mrobinson@webkit.org>, and Carlos Garcia Campos <cgarcia@igalia.com>." That means we'd need two copies of the regex though, I think. Perhaps it's easiest to just strip ", Jr." and the like from the string before processing. That would make things a lot simpler.
Dirk Pranke
Comment 11 2012-04-23 15:15:26 PDT
(In reply to comment #10) > (In reply to comment #9) > > > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." Isn't that case is hard to deal with regardless of whether the regexps are merged?
Martin Robinson
Comment 12 2012-04-23 15:46:43 PDT
(In reply to comment #11) > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." > > Isn't that case is hard to deal with regardless of whether the regexps are merged? It seems so. :/
Carlos Garcia Campos
Comment 13 2012-04-23 23:40:30 PDT
(In reply to comment #8) > (In reply to comment #7) > > > that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases. > > The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places. Yes, that makes sense
Carlos Garcia Campos
Comment 14 2012-04-23 23:45:27 PDT
(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #9) > > > > > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable. > > > > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." > > Isn't that case is hard to deal with regardless of whether the regexps are merged? Yes, this patch doesn't try to fix the reviewer text case, because it's impossible to know whether a , is part of a name or a separator. Fortunately multiple reviewers is not that common, so having a commit with multiple reviewers and one of them having a , in the name is very unlikely. Commits with multiple author is common though. I tried to fix the test cases I added to the unit tests using the existing functions, just playing with the regexp, but didn't manage to fix it until I merged both regexps, because we need the pattern name <email> in the regexp.
Eric Seidel (no email)
Comment 15 2012-08-22 15:26:45 PDT
Ryosuke was the last person in this code. :)
Ryosuke Niwa
Comment 16 2012-08-22 15:38:50 PDT
Comment on attachment 138318 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review > Tools/Scripts/webkitpy/common/checkout/changelog.py:-154 > - authors = ChangeLogEntry._split_contributor_names(text) I don't think we want to duplicate reg exps here.
Carlos Garcia Campos
Comment 17 2013-04-10 00:36:29 PDT
Tony fixed this already in r143504. *** This bug has been marked as a duplicate of bug 110356 ***
Note You need to log in before you can comment on or make changes to this bug.