Bug 216403

Summary: [webkitscmpy] Add Commit object
Product: WebKit Reporter: Jonathan Bedard <jbedard>
Component: AccessibilityAssignee: Jonathan Bedard <jbedard>
Status: RESOLVED FIXED    
Severity: Normal CC: aakash_jain, dean_johnson, dewei_zhu, kocsen_chung, slewis, webkit-bug-importer, zhifei_fang
Priority: P2 Keywords: InRadar
Version: WebKit Nightly Build   
Hardware: Unspecified   
OS: Unspecified   
See Also: https://bugs.webkit.org/show_bug.cgi?id=215862
https://bugs.webkit.org/show_bug.cgi?id=216404
Attachments:
Description Flags
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch
none
Patch none

Description Jonathan Bedard 2020-09-11 04:31:29 PDT
Add a sortable commit object that includes our the new monotonically increasing identifier.
Comment 1 Radar WebKit Bug Importer 2020-09-11 04:31:47 PDT
<rdar://problem/68702779>
Comment 2 Jonathan Bedard 2020-09-11 04:37:35 PDT
This change is being used to support https://bugs.webkit.org/show_bug.cgi?id=216404
Comment 3 Jonathan Bedard 2020-09-15 16:05:17 PDT
Created attachment 408874 [details]
Patch
Comment 4 Jonathan Bedard 2020-09-15 16:07:06 PDT
This change doesn't cover actually generating the identifiers, but does cover what those identifiers would look like as strings, how we parse them out, how we determine whether a particular string is an identifier/revision/git hash.
Comment 5 Jonathan Bedard 2020-09-18 18:10:39 PDT
Created attachment 409184 [details]
Patch
Comment 6 Jonathan Bedard 2020-09-23 11:12:16 PDT
Created attachment 409486 [details]
Patch
Comment 7 Dean Johnson 2020-09-23 11:40:16 PDT
Comment on attachment 409486 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409486&action=review

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:33
> +    IDENTIFIER_RE = re.compile(r'^i?(?P<identifier>-?\d+)(?P<branch>@\S+)?$')

Should we be capturing '@' as part of the branch group? That seems like it will lead to a ton of branch.lstrip('@') in caller's code.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:37
> +    def standardize_hash(cls, hash, do_assert=False):

Can we rename "standardize" to "parse" (for all standardize_* functions).

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:41
> +        if not isinstance(hash, str) and not isinstance(hash, unicode):

For Py2/3 compatibility it would be better to use six.string_types:
`if not isinstance(hash, six.string_types)`

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:62
> +            elif cls.NUMBER_RE.match(revision):

I think you can also call `<str_object>.isdigit()` instead of parsing via regex.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:79
> +        return revision

I think providing a float causes this code to return what was provided.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:94
> +            elif cls.NUMBER_RE.match(identifier):

I'm a bit confused -- how will we know if a number is an ID, a revision, or a hash?

Also, you can <str_object>.isdigit().

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:106
> +        return identifier, branch

The return types for `standardize_identifier` is different than `standardize_revision` and `standardize_hash`. That is prone to lead to confusion and bugs if someone uses the code.

Can we find a way to keep the return types the same? e.g. vending a new Commit object based on the parsed data.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:133
> +        if branch and not isinstance(branch, str) and not isinstance(branch, unicode):

Ditto to six.string_types comment.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:136
> +            raise ValueError("Caller passed both 'branch' and 'identifier', but specified different branches")

Can we also include the invalid values to assist with debugging?

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:139
> +            raise TypeError("Expected 'timestamp' to be of type {}".format(int))

Can we also include the invalid values to assist with debugging? Also, I don't think `int` needs to be formatted into the string.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:143
> +            raise TypeError("Expected 'author' to be of type {}".format(Contributor))

Ditto. Also, it seems like author should be able to be an email (for example) that is turned into a Contributor object.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:146
> +        if message and not isinstance(message, str) and not isinstance(message, unicode):

Ditto to six.string_types comment.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:156
> +            result += '    SVN revision: r{}'.format(self.revision)

Do we need to prefix with spaces? Since this is across multiple lines it will be messy to reformat for any caller.

Can we rename this function to "pretty_print" after Python's prettyprinter? to_string (to me) seems like the output you return for __repr__.
Comment 8 Jonathan Bedard 2020-09-23 11:51:21 PDT
Comment on attachment 409486 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=409486&action=review

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:33
>> +    IDENTIFIER_RE = re.compile(r'^i?(?P<identifier>-?\d+)(?P<branch>@\S+)?$')
> 
> Should we be capturing '@' as part of the branch group? That seems like it will lead to a ton of branch.lstrip('@') in caller's code.

We don't really want to make the @ optional, though.

All the callers access this through the "standardize_identifier" functions anyways, so they already get a striped branch.

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:62
>> +            elif cls.NUMBER_RE.match(revision):
> 
> I think you can also call `<str_object>.isdigit()` instead of parsing via regex.

That doesn't support negative numbers....which is fine for svn revisions, I suppose.

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:79
>> +        return revision
> 
> I think providing a float causes this code to return what was provided.

Oops, yeah, it would

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:94
>> +            elif cls.NUMBER_RE.match(identifier):
> 
> I'm a bit confused -- how will we know if a number is an ID, a revision, or a hash?
> 
> Also, you can <str_object>.isdigit().

That's handled in "parse". If you provide a raw integer, we assume identifier first, then revision and lastly a hash.

Identifiers can be negative, isdigit() won't work here

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:106
>> +        return identifier, branch
> 
> The return types for `standardize_identifier` is different than `standardize_revision` and `standardize_hash`. That is prone to lead to confusion and bugs if someone uses the code.
> 
> Can we find a way to keep the return types the same? e.g. vending a new Commit object based on the parsed data.

We can't vend a new Commit object because these functions are really helper functions used to construct the commit. Maybe it's best to preface these functions with an _ to indicate that they really aren't meant to be API

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:156
>> +            result += '    SVN revision: r{}'.format(self.revision)
> 
> Do we need to prefix with spaces? Since this is across multiple lines it will be messy to reformat for any caller.
> 
> Can we rename this function to "pretty_print" after Python's prettyprinter? to_string (to me) seems like the output you return for __repr__.

If callers were going to reformat, I assume they would either split the lines or construct their own string. What would you envision a caller would do with this output? I figured most of the time it would be printed directly.

pretty_print is a better name, will rename.
Comment 9 Jonathan Bedard 2020-09-25 21:18:13 PDT
Created attachment 409768 [details]
Patch
Comment 10 Jonathan Bedard 2020-09-28 14:06:28 PDT
Created attachment 409920 [details]
Patch
Comment 11 Jonathan Bedard 2020-09-29 15:36:47 PDT
Created attachment 410060 [details]
Patch
Comment 12 Jonathan Bedard 2020-09-29 15:48:52 PDT
Ryosuke had an idea to embed the branch point from the default branch into the identifier. I think this improves the human readability of the identifiers, but the additional information is actually not required to define which a commit. The current patch will print the branch point, if it's provided, but an identifier without a branch point will still be parsed.
Comment 13 Jonathan Bedard 2020-09-29 16:57:02 PDT
Created attachment 410068 [details]
Patch
Comment 14 dewei_zhu 2020-09-30 12:10:07 PDT
Comment on attachment 410068 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=410068&action=review

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:33
> +    IDENTIFIER_RE = re.compile(r'^[Ii]?((?P<branch_point>\d+)\.)?(?P<identifier>-?\d+)(@(?P<branch>\S+))?$')

For the identifier, will it be a string prefixed '-'?
m = IDENTIFIER_RE.match('123.-456@main')
m.group('identifier') will be '-456', is this expected?
Do we intentionally keep a '.' after branch point?

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:82
> +    def _parse_identifier(cls, identifier, do_assert=False):

This function have inconsistent return value type,
it can be:
None,
int, string
(int, int), string.
This makes parsing the results more complicated as you did on line 134.

> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/test/commit_unittest.py:87
> +        self.assertEqual(None, Commit._parse_identifier(3.141592))

Do we have test case like `'i123.-124@main` if this is supported identifier? If not, we should programmatically disable it.
Comment 15 Jonathan Bedard 2020-09-30 12:31:55 PDT
Comment on attachment 410068 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=410068&action=review

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:33
>> +    IDENTIFIER_RE = re.compile(r'^[Ii]?((?P<branch_point>\d+)\.)?(?P<identifier>-?\d+)(@(?P<branch>\S+))?$')
> 
> For the identifier, will it be a string prefixed '-'?
> m = IDENTIFIER_RE.match('123.-456@main')
> m.group('identifier') will be '-456', is this expected?
> Do we intentionally keep a '.' after branch point?

The weirdness here is because the branch point is helpful for humans, but it's not actually required to define a commit.

The example you provided is an invalid identifier, but not because of the negative value, it's because of the magnitude of the negative value (it's larger than the branch point) and the fact that "main" is likely the default branch. If you provide a negative identifier with a branch point (like i123.-4@branch-a), what you're really saying is "I want the commit 4 before the branch point on the default branch", (i119@main, in this case). Negative identifiers are valid arguments to scripts, but they are never going to be canonical identifiers of commits. If you passed something like "i123.-4@branch-a" to any service, we would fetch the commit matching those arguments, recompute the identifiers and refer to the commit as "i119@main". I'm working on the specific rules for canonical identifiers in https://bugs.webkit.org/show_bug.cgi?id=216404, those are separate from this change because they rely on your specific repository (namely, what your default branch is)

The "branch_point" intentionally is grouped with the '.' because commits on the default branch don't have a branch point

>> Tools/Scripts/libraries/webkitscmpy/webkitscmpy/commit.py:82
>> +    def _parse_identifier(cls, identifier, do_assert=False):
> 
> This function have inconsistent return value type,
> it can be:
> None,
> int, string
> (int, int), string.
> This makes parsing the results more complicated as you did on line 134.

I'm open to changing this, I'm just not sure the right way to do so. Returning None is handy for quickly testing, we could make "int, string" into "(None, int), string" to standardize it. Although in that case, I suppose it would be better to just return "int, int, string" and "None, int, string"
Comment 16 Jonathan Bedard 2020-09-30 13:49:42 PDT
Created attachment 410152 [details]
Patch
Comment 17 dewei_zhu 2020-09-30 15:48:00 PDT
r=me

Had some offline discussion with Johnathan, there will be a follow-up change to perform more comprehensive sanity check regarding negative identifier and branching points.
Comment 18 Jonathan Bedard 2020-09-30 17:13:37 PDT
Created attachment 410175 [details]
Patch
Comment 19 EWS 2020-09-30 17:46:01 PDT
Committed r267815: <https://trac.webkit.org/changeset/267815>

All reviewed patches have been landed. Closing bug and clearing flags on attachment 410175 [details].