Bug 84582 - webkitpy: changelog crashes when parsing authors containing + or ,
Summary: webkitpy: changelog crashes when parsing authors containing + or ,
Status: RESOLVED DUPLICATE of bug 110356
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-23 03:11 PDT by Carlos Garcia Campos
Modified: 2013-04-10 00:36 PDT (History)
7 users (show)

See Also:


Attachments
Patch (4.63 KB, patch)
2012-04-23 03:16 PDT, Carlos Garcia Campos
rniwa: review-
Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Garcia Campos 2012-04-23 03:11:49 PDT
E......
======================================================================
ERROR: test_parse_authors (webkitpy.common.checkout.changelog_unittest.ChangeLogTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "webkitpy/common/checkout/changelog_unittest.py", line 433, in test_parse_authors
    [('James Simonsen', 'simonjam+webkit@google.com')])
  File "webkitpy/common/checkout/changelog_unittest.py", line 417, in _assert_parse_authors
    parsed_authors = [(author['name'], author['email']) for author in self._entry_with_author(author_text).authors()]
  File "webkitpy/common/checkout/changelog_unittest.py", line 376, in _entry_with_author
    '''.replace("AUTHOR_TEXT", author_text))
  File "webkitpy/common/checkout/changelog.py", line 104, in __init__
    self._parse_entry()
  File "webkitpy/common/checkout/changelog.py", line 165, in _parse_entry
    self._authors = ChangeLogEntry._parse_author_text(self._author_text)
  File "webkitpy/common/checkout/changelog.py", line 156, in _parse_author_text
    return [ChangeLogEntry._parse_author_name_and_email(author) for author in authors]
  File "webkitpy/common/checkout/changelog.py", line 148, in _parse_author_name_and_email
    return {'name': match.group("name"), 'email': match.group("email")}
AttributeError: 'NoneType' object has no attribute 'group'

----------------------------------------------------------------------
Ran 16 tests in 0.566s

FAILED (errors=1)

The problem is that _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when tey are found at any position of the text. Author names can contains ',' and email can contain +.
Comment 1 Carlos Garcia Campos 2012-04-23 03:16:08 PDT
Created attachment 138318 [details]
Patch
Comment 2 Martin Robinson 2012-04-23 07:57:22 PDT
Comment on attachment 138318 [details]
Patch

I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem.
Comment 3 Carlos Garcia Campos 2012-04-23 08:10:38 PDT
(In reply to comment #2)
> (From update of attachment 138318 [details])
> I feel like the ChangeLog might be missing the "why." From your patch it looks like (unless I have misread) that you combined the regular expression to split things containing "and" or "&" into multiple parts with the regular expression that parses the name and email combination. I don't understand exactly why that fixes the problem.

As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators.
Comment 4 Martin Robinson 2012-04-23 08:14:14 PDT
(In reply to comment #3)

> As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators.

Sorry, I missed that.

Would it be possible to modify the regular expression to simply enforce spaces around '+'?  That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc.
Comment 5 Carlos Garcia Campos 2012-04-23 08:19:08 PDT
(In reply to comment #4)
> (In reply to comment #3)
> 
> > As I said in the description _parse_author_name_and_email uses _split_contributor_names which splits the text using a set of separators when they are found *at any position of the text*. That means that a , or + in the inside an author name or email would split the name or email. Merging both, and using findall, it looks for name <email> patterns separated by the set of separators.
> 
> Sorry, I missed that.
> 
> Would it be possible to modify the regular expression to simply enforce spaces around '+'?  That will allow us to keep the nice abstraction between splitting the name and email addresses and splitting things with "and", "&&", etc.

That would fix the case of +, but list of authors separated by , don't usually have spaces around.

Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org>
Comment 6 Martin Robinson 2012-04-23 08:42:44 PDT
(In reply to comment #5)

> That would fix the case of +, but list of authors separated by , don't usually have spaces around.
> 
> Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org>

It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+".
Comment 7 Carlos Garcia Campos 2012-04-23 08:53:40 PDT
(In reply to comment #6)
> (In reply to comment #5)
> 
> > That would fix the case of +, but list of authors separated by , don't usually have spaces around.
> > 
> > Foo Bar <foo@bar.com>, Baz <bazmail@webkit.org>
> 
> It would probably be enough only to make this a requirement for characters that can also exist in names and email addresses. Specifically, the words "and" and "+".

that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases.
Comment 8 Martin Robinson 2012-04-23 11:25:11 PDT
(In reply to comment #7)

> that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases.

The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places.
Comment 9 Dirk Pranke 2012-04-23 14:45:16 PDT
Comment on attachment 138318 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review

> Tools/Scripts/webkitpy/common/checkout/changelog.py:147
> +        return [{'name': name, 'email': email} for name, email in re.findall(r'(?P<name>.+?)\s+<(?P<email>[^>]+)>(?:\s*(?:,(?:\s+and\s+|&)?|(?:^|\s+)and\s+|&&|[/+&])\s*)?', text) if text]

Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable.

I'd be surprised if this code is particularly performance critical, so if it isn't, I would suggest splitting this (or at least _split_contributor_names() ) up into multiple regexps bracketed by if/elses in order to make the code easier to comprehend.
Comment 10 Martin Robinson 2012-04-23 15:01:05 PDT
(In reply to comment #9)

> Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable.

This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." You could make it work for the attribution line by only splitting names if the comma is prefixed by the end of an email address ala "Pawel Hadjan, Jr. <foo@foo.com>, Martin Robinson <mrobinson@webkit.org>, and Carlos Garcia Campos <cgarcia@igalia.com>." That means we'd need two copies of the regex though, I think. Perhaps it's easiest to just strip ", Jr." and the like from the string before processing. That would make things a lot simpler.
Comment 11 Dirk Pranke 2012-04-23 15:15:26 PDT
(In reply to comment #10)
> (In reply to comment #9)
> 
> > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable.
> 
> This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." 

Isn't that case is hard to deal with regardless of whether the regexps are merged?
Comment 12 Martin Robinson 2012-04-23 15:46:43 PDT
(In reply to comment #11)

> > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." 
> 
> Isn't that case is hard to deal with regardless of whether the regexps are merged?

It seems so. :/
Comment 13 Carlos Garcia Campos 2012-04-23 23:40:30 PDT
(In reply to comment #8)
> (In reply to comment #7)
> 
> > that wouldn't fix the real test case I added to the unit tests where there's a , in the author name. What's wrong with the proposed regexp? It fixes all test cases.
> 
> The new regular expressions look like it works great, but it has the downside of duplicating a fairly complicated regular expression in two places. Later if someone wants to add to the regex or fix a bug, they may miss one copy of it. Is there a way to reuse it in both places?Perhaps even just storing it as a string that you can use in both places.

Yes, that makes sense
Comment 14 Carlos Garcia Campos 2012-04-23 23:45:27 PDT
(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #9)
> > 
> > > Is it possible to implement this correctly without merging the regexps? Frankly, both of these regexps were already verging on unmaintainable.
> > 
> > This might be tricky. It's unclear what to do with a string like this "Reviewed by Pawel Hadjan, Jr., Martin Robinson, and Carlos Garcia Campos." 
> 
> Isn't that case is hard to deal with regardless of whether the regexps are merged?

Yes, this patch doesn't try to fix the reviewer text case, because it's impossible to know whether a , is part of a name or a separator. Fortunately multiple reviewers is not that common, so having a commit with multiple reviewers and one of them having a , in the name is very unlikely. Commits with multiple author is common though. 

I tried to fix the test cases I added to the unit tests using the existing functions, just playing with the regexp, but didn't manage to fix it until I merged both regexps, because we need the pattern name <email> in the regexp.
Comment 15 Eric Seidel (no email) 2012-08-22 15:26:45 PDT
Ryosuke was the last person in this code. :)
Comment 16 Ryosuke Niwa 2012-08-22 15:38:50 PDT
Comment on attachment 138318 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=138318&action=review

> Tools/Scripts/webkitpy/common/checkout/changelog.py:-154
> -        authors = ChangeLogEntry._split_contributor_names(text)

I don't think we want to duplicate reg exps here.
Comment 17 Carlos Garcia Campos 2013-04-10 00:36:29 PDT
Tony fixed this already in r143504.

*** This bug has been marked as a duplicate of bug 110356 ***