Bug 72243

Summary: Add a tool to analyze change logs
Product: WebKit Reporter: Ryosuke Niwa <rniwa>
Component: Tools / TestsAssignee: Ryosuke Niwa <rniwa>
Status: RESOLVED FIXED    
Severity: Normal CC: abarth, eric, evan, jryans, leandro, ojan, ossy, webkit.review.bot
Priority: P2    
Version: 528+ (Nightly build)   
Hardware: Unspecified   
OS: Unspecified   
Bug Depends on: 72351, 72540, 72566, 72636, 72690, 72703    
Bug Blocks: 68061    
Attachments:
Description Flags
work in progress
none
json files generated by the script + summary.html
none
work in progress 2
none
json files + summary.html version 2
none
screenshot
none
work in progress 3
none
Initial patch
none
Fixed per Eric's comment; also updated for ToT
none
Fixed per Eric's comment
none
Fixed per Eric's review eric: review+

Description Ryosuke Niwa 2011-11-13 23:29:32 PST
We need a tool that analyzes change logs in order to figure out areas of contributions of each contributor
Comment 1 Ryosuke Niwa 2011-11-13 23:30:26 PST
Created attachment 114885 [details]
work in progress
Comment 2 Ryosuke Niwa 2011-11-13 23:31:13 PST
Created attachment 114886 [details]
json files generated by the script + summary.html
Comment 3 Ryosuke Niwa 2011-11-15 22:55:32 PST
Created attachment 115327 [details]
work in progress 2
Comment 4 Ryosuke Niwa 2011-11-15 22:56:37 PST
Created attachment 115328 [details]
json files + summary.html version 2
Comment 5 Ryosuke Niwa 2011-11-16 01:39:39 PST
Created attachment 115348 [details]
screenshot
Comment 6 Ryosuke Niwa 2011-11-20 14:22:47 PST
Created attachment 115998 [details]
work in progress 3
Comment 7 Ryosuke Niwa 2011-11-20 23:05:47 PST
I'm stuck. Apparently, there's no way to get "git svn blame <file>" at a specific SVN revision. This is necessary because git svn blame for WebCore/ChangeLog-2011-10-19 consists entirely of r97937.

The only solution I can think of is to rely on each commit and retrieve change log entries added by each commit (by git/svn diff). But this will be substantially slower (we'll end up spending 1-2s per revision) so we'll need 27 hours to process 100,000 entries.
Comment 8 Eric Seidel (no email) 2011-11-20 23:08:49 PST
I'm not sure I understand.  Why not just go back to before r97937 and do the blame?
Comment 9 Ryosuke Niwa 2011-11-20 23:14:17 PST
(In reply to comment #8)
> I'm not sure I understand.  Why not just go back to before r97937 and do the blame?

Do you know how to do that on git? I've spent hours googling about that but couldn't figure it out.
Comment 10 Eric Seidel (no email) 2011-11-20 23:23:15 PST
Short of checking out to that revision, I'm not sure.  Evan is a third-order git ninja.  I suggest asking him.
Comment 11 Ryosuke Niwa 2011-11-20 23:40:36 PST
FYI, we can do this on a svn client by "svn blame -r<revision number> <paths>"
Comment 12 Eric Seidel (no email) 2011-11-21 00:37:33 PST
I believe git-svn has a secret hidden svn checkout somewhere... so maybe we could use that?
Comment 13 Ryosuke Niwa 2011-11-21 16:12:43 PST
Okay, I'm just going to add the current version that doesn't support parsing change logs from multiple directories for now so that other contributors can at least play with what I have so far.
Comment 14 Ryosuke Niwa 2011-11-22 13:48:47 PST
Created attachment 116271 [details]
Initial patch
Comment 15 Eric Seidel (no email) 2011-11-22 14:22:31 PST
Comment on attachment 116271 [details]
Initial patch

View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review

It's a big patch, I'll have to read through this again.

> Tools/Scripts/webkitpy/common/checkout/changelog.py:123
> +        if is_rollover:
> +            self._author = None
> +            self._author_name = None
> +            self._author_email = None
> +            self._reviewer_text = None
> +            self._reviewers = []
> +            self._touched_files = []

I'm not a big fan of having a special entry for roll-overs...

> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386
> +            return ''

Should this be none?  unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31
> +from time import time

I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports).

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54
> +        changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)]

self._tool.filesystem. :)  Why is this static?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61
> +    def _generate_jsons(filesystem, json_files, output_dir):

Why static?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71
> +        filesystem = FileSystem()

The tool has a filesystem. :)  tool.filesystem

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108
> +        self._filesystem = filesystem

Yeah, I think you want to pass this a Host or a Tool object.  Tool is a subclass of Host, which is a subclass of SystemHost.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116
> +        self._areas_statistics = {}
> +        for area in self._contribution_areas.names():
> +            self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}

This can be done as a list comprehension, it's just slightly gnarly:

self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()])

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191
> +        i = 0
> +        dirname = self._filesystem.dirname(changelog_path)
> +        for entry in entries:

You want for entry, index in enumerate(entries): :)
Comment 16 Ryosuke Niwa 2011-11-22 14:54:27 PST
Comment on attachment 116271 [details]
Initial patch

View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review

>> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386
>> +            return ''
> 
> Should this be none?  unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right.

Fixed.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31
>> +from time import time
> 
> I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports).

It seemed silly to repeat time.time() but okay. Fixed.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54
>> +        changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)]
> 
> self._tool.filesystem. :)  Why is this static?

For testing.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61
>> +    def _generate_jsons(filesystem, json_files, output_dir):
> 
> Why static?

Ditto.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71
>> +        filesystem = FileSystem()
> 
> The tool has a filesystem. :)  tool.filesystem

Fixed.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108
>> +        self._filesystem = filesystem
> 
> Yeah, I think you want to pass this a Host or a Tool object.  Tool is a subclass of Host, which is a subclass of SystemHost.

Fixed.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116
>> +            self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}
> 
> This can be done as a list comprehension, it's just slightly gnarly:
> 
> self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()])

Fixed.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191
>> +        for entry in entries:
> 
> You want for entry, index in enumerate(entries): :)

Hm... I had to skip entires without author names in some changelogs but I guess it's okay to assert(entry.authors()) for now.
Comment 17 Ryosuke Niwa 2011-11-22 14:55:33 PST
Comment on attachment 116271 [details]
Initial patch

View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review

>> Tools/Scripts/webkitpy/common/checkout/changelog.py:123
>> +            self._touched_files = []
> 
> I'm not a big fan of having a special entry for roll-overs...

I actually realized that this rollover business is not needed for this iteration so will revert the entire change.
Comment 18 Ryosuke Niwa 2011-11-22 15:16:51 PST
Created attachment 116281 [details]
Fixed per Eric's comment; also updated for ToT
Comment 19 Ryosuke Niwa 2011-11-23 11:56:31 PST
Ping eric.
Comment 20 Eric Seidel (no email) 2011-11-30 11:43:37 PST
Comment on attachment 116281 [details]
Fixed per Eric's comment; also updated for ToT

View in context: https://bugs.webkit.org/attachment.cgi?id=116281&action=review

> Tools/Scripts/webkitpy/common/checkout/changelog.py:279
> +            if rolled_over_regexp.match(line):
> +                break

There can never (or should never?) be anything after the rolled over line?

> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:400
> +            raise StopIteration

You're blowing my mind.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:207
> +            i += 1

You dont' need this anymore. :)

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:209
> +        return i

Can this work?  Isn't i tied to teh context of the for loop?
Comment 21 Ryosuke Niwa 2011-11-30 16:15:26 PST
Created attachment 117290 [details]
Fixed per Eric's comment
Comment 22 Eric Seidel (no email) 2011-11-30 16:34:49 PST
Comment on attachment 117290 [details]
Fixed per Eric's comment

View in context: https://bugs.webkit.org/attachment.cgi?id=117290&action=review

I didn't review the html/js at all.  The python looks generally right, but probably needs one more round.

> Tools/Scripts/webkitpy/common/checkout/changelog.py:279
> +            if rolled_over_regexp.match(line):
> +                break

You should comment about why break is OK.  I'm sorry I didn't make that clear in my previous commetn.  Nothing big, but it's just not immediately obvious why we'd terminate the loop here.

> Tools/Scripts/webkitpy/common/checkout/changelog.py:288
> +        yield ChangeLogEntry(''.join(entry_lines[:-1]))
> +

Why do we add this one at the end?  Is this part of the chagne tested?  This could make a good separate patch w/ test case. :)

> Tools/Scripts/webkitpy/common/config/contributionareas.py:181
> +    def names(self):
> +        return [area.name() for area in self._contribution_areas]

I think map(operator.namegetter('name'), self._contribution_areas) is another way to write this.   I'm not sure if that's any more clear or concise though.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:32
> +import time
> +from webkitpy.common.checkout.scm.detection import SCMDetector

I normally put a blank line between system imports and webkitpy imports, but I don't think that's any sort of official style.  You don't need to chagne it, just letting you know.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48
> +            steps.Options.changelog_count,

What does this do?  Is this how many changelogs back to search?  Feels a bit strangely named.  Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs.  the options "git log" has make sense here. :)

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57
> +    def _enumerate_changelogs(filesystem, dirname, changelog_count):
> +        changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)]
> +        # Make sure ChangeLog shows up before ChangeLog-2011-01-01
> +        changelogs = sorted(changelogs, key=lambda filename: filename + 'X', reverse=True)
> +        return changelogs[:changelog_count]

Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :)  That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:60
> +    def _generate_jsons(filesystem, json_files, output_dir):

jsons?  json_files?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67
> +            output_file = filesystem.open_text_file_for_writing(filesystem.join(output_dir, filename))
> +            try:
> +                print '    Generating', filename
> +                output_file.write(json.dumps(json_files[filename], indent=2))
> +            finally:
> +                output_file.close()

I think you want  to use "with" instead.  That will autoclose at the end of the with block.  Also, why not just use write_text_file directly?  Or even json.dump? (which can take a filehandle)!   So many better ways...

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96
> +        commands_dir = filesystem.dirname(filesystem.abspath(__file__))

You want filesystem.path_to_module(self.__module__) instead, I think.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108
> +        self._filesystem = host.filesystem
> +        self._contribution_areas = ContributionAreas(host.filesystem)
> +        self._scm = host.scm()

I'm curious why you store the filesystem and scm separately instead of just storing the host?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:128
> +        status = new_status + (' ' * max(len(self._previous_status) - len(new_status) - 1, 0))

local var for the max result might be helpful to clarity here.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129
> +        print "\b" * len(self._previous_status) + status,

\b?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146
> +            changelog = self._filesystem.open_text_file_for_reading(path)
> +            try:
> +                self._print_status('Parsing entries...')
> +                number_of_parsed_entries = self._analyze_entries(ChangeLog.parse_entries_from_file(changelog), path)
> +            finally:
> +                changelog.close()

Yup, again, "with" or "write_text_file" are better options.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156
> +            area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0}

These dictionaries start to feel like objects (with default values in their constructor).

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:173
> +        if contribution_type == 'reviews':
> +            statistics['total'] += 1
> +        elif reviewed:
> +            statistics['reviewed'] += 1
> +        else:
> +            statistics['unreviewed'] += 1

Yeah, you type these strings for these dictionaries several times here. :)

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:182
> +    def _increment_dictionary_value(self, dictionary, key):
> +        dictionary[key] = dictionary.get(key, 0) + 1

Does setdefault() get you this?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204
> +            for reviewer in reviewers_for_entry:
> +                self._collect_statistics_for_contributor(reviewer.full_name, 'reviews', areas_for_entry, touchedfiles_for_entry, reviewed=True)
> +
> +            for author in authors_for_entry:
> +                self._collect_statistics_for_contributor(author['name'], 'patches', areas_for_entry, touchedfiles_for_entry,
> +                    reviewed=bool(reviewers_for_entry))
> +
> +            for area in areas_for_entry:
> +                self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1

Should you just be passing entry to these function calls?
Comment 23 Ryosuke Niwa 2011-12-02 22:00:58 PST
Comment on attachment 117290 [details]
Fixed per Eric's comment

View in context: https://bugs.webkit.org/attachment.cgi?id=117290&action=review

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48
>> +            steps.Options.changelog_count,
> 
> What does this do?  Is this how many changelogs back to search?  Feels a bit strangely named.  Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs.  the options "git log" has make sense here. :)

Going back to specific date is tricky because people put entries with wrong dates all the time. I need to develop some heuristics for that.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57
>> +        return changelogs[:changelog_count]
> 
> Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :)  That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense.

In the current design, the order in which entries appear don't matter at all. If we really think the order matter then we probably want to parse logs per revision. Even though that'll take forever (ETA 27h), it'll guarantee the ordering.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67
>> +                output_file.close()
> 
> I think you want  to use "with" instead.  That will autoclose at the end of the with block.  Also, why not just use write_text_file directly?  Or even json.dump? (which can take a filehandle)!   So many better ways...

Rewritten using write_text_file.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96
>> +        commands_dir = filesystem.dirname(filesystem.abspath(__file__))
> 
> You want filesystem.path_to_module(self.__module__) instead, I think.

Done.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108
>> +        self._scm = host.scm()
> 
> I'm curious why you store the filesystem and scm separately instead of just storing the host?

For ease of access and also because I don't really host.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129
>> +        print "\b" * len(self._previous_status) + status,
> 
> \b?

Rewritten using \r.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146
>> +                changelog.close()
> 
> Yup, again, "with" or "write_text_file" are better options.

Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156
>> +            area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0}
> 
> These dictionaries start to feel like objects (with default values in their constructor).

Yeah but then we'll end up having lots of them, and I need to define __str__ for them.

>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204
>> +                self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1
> 
> Should you just be passing entry to these function calls?

No. Calling _contribution_areas.areas_for_touched_files is very expensive. We can't afford to do that twice.
Comment 24 Ryosuke Niwa 2011-12-02 22:05:38 PST
Created attachment 117741 [details]
Fixed per Eric's review
Comment 25 Eric Seidel (no email) 2011-12-02 23:30:34 PST
(In reply to comment #23)
> (From update of attachment 117290 [details])
> Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import.

I will remove all the __future__ imports when we drop support for python 2.5.  Until then, they're required, as there may be bots running test-webkitpy on 2.5.
Comment 26 Leandro Pereira 2011-12-06 07:59:52 PST
Comment on attachment 117741 [details]
Fixed per Eric's review

View in context: https://bugs.webkit.org/attachment.cgi?id=117741&action=review

I like this; cleaner than my solution.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:115
> +        self._longest_filename = max([len(path) - len(self._scm.checkout_root) for path in changelog_paths])

There's no need to create a temporary list here; just use a generator:

   self._longest_filename = max(len(path) ... for path in changelog_paths)

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:149
> +        self._summary['contributors_with_reviews'] = sum([1 for contributor in self._contributors_statistics.values() if contributor['reviews']['total']])

Instead of sum(), you could use len(). Also, there's no need to create a temporary list, just use a generator here:

   ... = len(1 for contributor in ... if ...)

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:206
> +            i += 1

No-op. You're iterating over a generator created by enumerate(), ``i'' will be overwritten in the next step. If you want to start from 1, you can use the start parameter of enumerate: enumerate(entries, start=1). I'm not sure if it's available in the minimum version of Python we're supposed to support, but since you're using ``i'' only to call _print_status(), printing (i + 1) would suffice here.
Comment 27 Ryosuke Niwa 2011-12-08 09:10:28 PST
Ping reviewers.
Comment 28 Eric Seidel (no email) 2011-12-08 11:39:26 PST
Comment on attachment 117741 [details]
Fixed per Eric's review

View in context: https://bugs.webkit.org/attachment.cgi?id=117741&action=review

I expect we could make this nicer with a few more rounds.  But this is OK to go.  Thanks for hte patch.  Please fix my and leandro's comments before landing.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96
> +        print commands_dir

Do you wnat this printing?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:137
> +        if self._filename:
> +            print

Hum?

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:65
> +        test_json = {'array.json': [1, 2, 3, {'key': 'value'}], 'dictionary.json': {'somekey': 'somevalue', 'array': [4, 5]}}
> +        sys_stdout = sys.stdout
> +        try:
> +            sys.stdout = filesystem.open_text_file_for_writing('stdout')
> +            AnalyzeChangeLog._generate_jsons(filesystem, test_json, 'bar')
> +            self.assertEqual(set(filesystem.files.keys()), set(['bar/array.json', 'bar/dictionary.json', 'stdout']))
> +        finally:
> +            sys.stdout = sys_stdout

Huh?  do you want OutputCapture?  That's our testing way of overriding sys.stdout.

> Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:148
> +        sys_stdout = sys.stdout
> +        try:
> +            sys.stdout = host.filesystem.open_text_file_for_writing('stdout')
> +            analyzer = ChangeLogAnalyzer(host, ['mock-checkout/foo/ChangeLog'])
> +            analyzer.analyze()
> +        finally:
> +            sys.stdout = sys_stdout

Again, outputcapture.
Comment 29 Ryosuke Niwa 2012-01-03 13:25:23 PST
Committed r103959: <http://trac.webkit.org/changeset/103959>
Comment 30 Evan Martin 2012-01-03 16:45:25 PST
I just saw the git question.  In general git svn is a weak shell around underlying git commands, so if you need to do something fancy with an svn revision, the right way is to map it to a git hash first.  E.g.
  hash = git svn find-rev r1234
  git blame --complex-git-flags $hash