72243 – Add a tool to analyze change logs

RESOLVED FIXED 72243

Add a tool to analyze change logs

https://bugs.webkit.org/show_bug.cgi?id=72243

Summary Add a tool to analyze change logs

Ryosuke Niwa

Reported 2011-11-13 23:29:32 PST

We need a tool that analyzes change logs in order to figure out areas of contributions of each contributor

Attachments
work in progress (16.81 KB, patch) 2011-11-13 23:30 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
json files generated by the script + summary.html (deleted) 2011-11-13 23:31 PST, Ryosuke Niwa	no flags	Details
work in progress 2 (18.38 KB, patch) 2011-11-15 22:55 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
json files + summary.html version 2 (deleted) 2011-11-15 22:56 PST, Ryosuke Niwa	no flags	Details
screenshot (deleted) 2011-11-16 01:39 PST, Ryosuke Niwa	no flags	Details
work in progress 3 (23.83 KB, patch) 2011-11-20 14:22 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Initial patch (41.98 KB, patch) 2011-11-22 13:48 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Fixed per Eric's comment; also updated for ToT (40.27 KB, patch) 2011-11-22 15:16 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Fixed per Eric's comment (39.88 KB, patch) 2011-11-30 16:15 PST, Ryosuke Niwa	no flags	Details Formatted Diff Diff
Fixed per Eric's review (38.53 KB, patch) 2011-12-02 22:05 PST, Ryosuke Niwa	eric: review+	Details Formatted Diff Diff
Show Obsolete (7) View All Add attachment proposed patch, testcase, etc.

Ryosuke Niwa

Comment 1 2011-11-13 23:30:26 PST

Created attachment 114885 [details] work in progress

Ryosuke Niwa

Comment 2 2011-11-13 23:31:13 PST

Created attachment 114886 [details] json files generated by the script + summary.html

Ryosuke Niwa

Comment 3 2011-11-15 22:55:32 PST

Created attachment 115327 [details] work in progress 2

Ryosuke Niwa

Comment 4 2011-11-15 22:56:37 PST

Created attachment 115328 [details] json files + summary.html version 2

Ryosuke Niwa

Comment 5 2011-11-16 01:39:39 PST

Created attachment 115348 [details] screenshot

Ryosuke Niwa

Comment 6 2011-11-20 14:22:47 PST

Created attachment 115998 [details] work in progress 3

Ryosuke Niwa

Comment 7 2011-11-20 23:05:47 PST

I'm stuck. Apparently, there's no way to get "git svn blame <file>" at a specific SVN revision. This is necessary because git svn blame for WebCore/ChangeLog-2011-10-19 consists entirely of r97937. The only solution I can think of is to rely on each commit and retrieve change log entries added by each commit (by git/svn diff). But this will be substantially slower (we'll end up spending 1-2s per revision) so we'll need 27 hours to process 100,000 entries.

Eric Seidel (no email)

Comment 8 2011-11-20 23:08:49 PST

I'm not sure I understand. Why not just go back to before r97937 and do the blame?

Ryosuke Niwa

Comment 9 2011-11-20 23:14:17 PST

(In reply to comment #8) > I'm not sure I understand. Why not just go back to before r97937 and do the blame? Do you know how to do that on git? I've spent hours googling about that but couldn't figure it out.

Eric Seidel (no email)

Comment 10 2011-11-20 23:23:15 PST

Short of checking out to that revision, I'm not sure. Evan is a third-order git ninja. I suggest asking him.

Ryosuke Niwa

Comment 11 2011-11-20 23:40:36 PST

FYI, we can do this on a svn client by "svn blame -r<revision number> <paths>"

Eric Seidel (no email)

Comment 12 2011-11-21 00:37:33 PST

I believe git-svn has a secret hidden svn checkout somewhere... so maybe we could use that?

Ryosuke Niwa

Comment 13 2011-11-21 16:12:43 PST

Okay, I'm just going to add the current version that doesn't support parsing change logs from multiple directories for now so that other contributors can at least play with what I have so far.

Ryosuke Niwa

Comment 14 2011-11-22 13:48:47 PST

Created attachment 116271 [details] Initial patch

Eric Seidel (no email)

Comment 15 2011-11-22 14:22:31 PST

Comment on attachment 116271 [details] Initial patch View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review It's a big patch, I'll have to read through this again. > Tools/Scripts/webkitpy/common/checkout/changelog.py:123 > + if is_rollover: > + self._author = None > + self._author_name = None > + self._author_email = None > + self._reviewer_text = None > + self._reviewers = [] > + self._touched_files = [] I'm not a big fan of having a special entry for roll-overs... > Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386 > + return '' Should this be none? unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31 > +from time import time I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports). > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54 > + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)] self._tool.filesystem. :) Why is this static? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61 > + def _generate_jsons(filesystem, json_files, output_dir): Why static? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71 > + filesystem = FileSystem() The tool has a filesystem. :) tool.filesystem > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 > + self._filesystem = filesystem Yeah, I think you want to pass this a Host or a Tool object. Tool is a subclass of Host, which is a subclass of SystemHost. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116 > + self._areas_statistics = {} > + for area in self._contribution_areas.names(): > + self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}} This can be done as a list comprehension, it's just slightly gnarly: self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()]) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191 > + i = 0 > + dirname = self._filesystem.dirname(changelog_path) > + for entry in entries: You want for entry, index in enumerate(entries): :)

Ryosuke Niwa

Comment 16 2011-11-22 14:54:27 PST

Comment on attachment 116271 [details] Initial patch View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review >> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386 >> + return '' > > Should this be none? unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right. Fixed. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31 >> +from time import time > > I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports). It seemed silly to repeat time.time() but okay. Fixed. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54 >> + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)] > > self._tool.filesystem. :) Why is this static? For testing. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61 >> + def _generate_jsons(filesystem, json_files, output_dir): > > Why static? Ditto. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71 >> + filesystem = FileSystem() > > The tool has a filesystem. :) tool.filesystem Fixed. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 >> + self._filesystem = filesystem > > Yeah, I think you want to pass this a Host or a Tool object. Tool is a subclass of Host, which is a subclass of SystemHost. Fixed. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116 >> + self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}} > > This can be done as a list comprehension, it's just slightly gnarly: > > self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()]) Fixed. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191 >> + for entry in entries: > > You want for entry, index in enumerate(entries): :) Hm... I had to skip entires without author names in some changelogs but I guess it's okay to assert(entry.authors()) for now.

Ryosuke Niwa

Comment 17 2011-11-22 14:55:33 PST

Comment on attachment 116271 [details] Initial patch View in context: https://bugs.webkit.org/attachment.cgi?id=116271&action=review >> Tools/Scripts/webkitpy/common/checkout/changelog.py:123 >> + self._touched_files = [] > > I'm not a big fan of having a special entry for roll-overs... I actually realized that this rollover business is not needed for this iteration so will revert the entire change.

Ryosuke Niwa

Comment 18 2011-11-22 15:16:51 PST

Created attachment 116281 [details] Fixed per Eric's comment; also updated for ToT

Ryosuke Niwa

Comment 19 2011-11-23 11:56:31 PST

Ping eric.

Eric Seidel (no email)

Comment 20 2011-11-30 11:43:37 PST

Comment on attachment 116281 [details] Fixed per Eric's comment; also updated for ToT View in context: https://bugs.webkit.org/attachment.cgi?id=116281&action=review > Tools/Scripts/webkitpy/common/checkout/changelog.py:279 > + if rolled_over_regexp.match(line): > + break There can never (or should never?) be anything after the rolled over line? > Tools/Scripts/webkitpy/common/system/filesystem_mock.py:400 > + raise StopIteration You're blowing my mind. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:207 > + i += 1 You dont' need this anymore. :) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:209 > + return i Can this work? Isn't i tied to teh context of the for loop?

Ryosuke Niwa

Comment 21 2011-11-30 16:15:26 PST

Created attachment 117290 [details] Fixed per Eric's comment

Eric Seidel (no email)

Comment 22 2011-11-30 16:34:49 PST

Comment on attachment 117290 [details] Fixed per Eric's comment View in context: https://bugs.webkit.org/attachment.cgi?id=117290&action=review I didn't review the html/js at all. The python looks generally right, but probably needs one more round. > Tools/Scripts/webkitpy/common/checkout/changelog.py:279 > + if rolled_over_regexp.match(line): > + break You should comment about why break is OK. I'm sorry I didn't make that clear in my previous commetn. Nothing big, but it's just not immediately obvious why we'd terminate the loop here. > Tools/Scripts/webkitpy/common/checkout/changelog.py:288 > + yield ChangeLogEntry(''.join(entry_lines[:-1])) > + Why do we add this one at the end? Is this part of the chagne tested? This could make a good separate patch w/ test case. :) > Tools/Scripts/webkitpy/common/config/contributionareas.py:181 > + def names(self): > + return [area.name() for area in self._contribution_areas] I think map(operator.namegetter('name'), self._contribution_areas) is another way to write this. I'm not sure if that's any more clear or concise though. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:32 > +import time > +from webkitpy.common.checkout.scm.detection import SCMDetector I normally put a blank line between system imports and webkitpy imports, but I don't think that's any sort of official style. You don't need to chagne it, just letting you know. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48 > + steps.Options.changelog_count, What does this do? Is this how many changelogs back to search? Feels a bit strangely named. Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs. the options "git log" has make sense here. :) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57 > + def _enumerate_changelogs(filesystem, dirname, changelog_count): > + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)] > + # Make sure ChangeLog shows up before ChangeLog-2011-01-01 > + changelogs = sorted(changelogs, key=lambda filename: filename + 'X', reverse=True) > + return changelogs[:changelog_count] Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :) That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:60 > + def _generate_jsons(filesystem, json_files, output_dir): jsons? json_files? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67 > + output_file = filesystem.open_text_file_for_writing(filesystem.join(output_dir, filename)) > + try: > + print ' Generating', filename > + output_file.write(json.dumps(json_files[filename], indent=2)) > + finally: > + output_file.close() I think you want to use "with" instead. That will autoclose at the end of the with block. Also, why not just use write_text_file directly? Or even json.dump? (which can take a filehandle)! So many better ways... > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 > + commands_dir = filesystem.dirname(filesystem.abspath(__file__)) You want filesystem.path_to_module(self.__module__) instead, I think. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 > + self._filesystem = host.filesystem > + self._contribution_areas = ContributionAreas(host.filesystem) > + self._scm = host.scm() I'm curious why you store the filesystem and scm separately instead of just storing the host? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:128 > + status = new_status + (' ' * max(len(self._previous_status) - len(new_status) - 1, 0)) local var for the max result might be helpful to clarity here. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129 > + print "\b" * len(self._previous_status) + status, \b? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146 > + changelog = self._filesystem.open_text_file_for_reading(path) > + try: > + self._print_status('Parsing entries...') > + number_of_parsed_entries = self._analyze_entries(ChangeLog.parse_entries_from_file(changelog), path) > + finally: > + changelog.close() Yup, again, "with" or "write_text_file" are better options. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156 > + area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0} These dictionaries start to feel like objects (with default values in their constructor). > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:173 > + if contribution_type == 'reviews': > + statistics['total'] += 1 > + elif reviewed: > + statistics['reviewed'] += 1 > + else: > + statistics['unreviewed'] += 1 Yeah, you type these strings for these dictionaries several times here. :) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:182 > + def _increment_dictionary_value(self, dictionary, key): > + dictionary[key] = dictionary.get(key, 0) + 1 Does setdefault() get you this? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204 > + for reviewer in reviewers_for_entry: > + self._collect_statistics_for_contributor(reviewer.full_name, 'reviews', areas_for_entry, touchedfiles_for_entry, reviewed=True) > + > + for author in authors_for_entry: > + self._collect_statistics_for_contributor(author['name'], 'patches', areas_for_entry, touchedfiles_for_entry, > + reviewed=bool(reviewers_for_entry)) > + > + for area in areas_for_entry: > + self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1 Should you just be passing entry to these function calls?

Ryosuke Niwa

Comment 23 2011-12-02 22:00:58 PST

Comment on attachment 117290 [details] Fixed per Eric's comment View in context: https://bugs.webkit.org/attachment.cgi?id=117290&action=review >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48 >> + steps.Options.changelog_count, > > What does this do? Is this how many changelogs back to search? Feels a bit strangely named. Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs. the options "git log" has make sense here. :) Going back to specific date is tricky because people put entries with wrong dates all the time. I need to develop some heuristics for that. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57 >> + return changelogs[:changelog_count] > > Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :) That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense. In the current design, the order in which entries appear don't matter at all. If we really think the order matter then we probably want to parse logs per revision. Even though that'll take forever (ETA 27h), it'll guarantee the ordering. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67 >> + output_file.close() > > I think you want to use "with" instead. That will autoclose at the end of the with block. Also, why not just use write_text_file directly? Or even json.dump? (which can take a filehandle)! So many better ways... Rewritten using write_text_file. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 >> + commands_dir = filesystem.dirname(filesystem.abspath(__file__)) > > You want filesystem.path_to_module(self.__module__) instead, I think. Done. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 >> + self._scm = host.scm() > > I'm curious why you store the filesystem and scm separately instead of just storing the host? For ease of access and also because I don't really host. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129 >> + print "\b" * len(self._previous_status) + status, > > \b? Rewritten using \r. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146 >> + changelog.close() > > Yup, again, "with" or "write_text_file" are better options. Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156 >> + area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0} > > These dictionaries start to feel like objects (with default values in their constructor). Yeah but then we'll end up having lots of them, and I need to define __str__ for them. >> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204 >> + self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1 > > Should you just be passing entry to these function calls? No. Calling _contribution_areas.areas_for_touched_files is very expensive. We can't afford to do that twice.

Ryosuke Niwa

Comment 24 2011-12-02 22:05:38 PST

Created attachment 117741 [details] Fixed per Eric's review

Eric Seidel (no email)

Comment 25 2011-12-02 23:30:34 PST

(In reply to comment #23) > (From update of attachment 117290 [details]) > Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import. I will remove all the __future__ imports when we drop support for python 2.5. Until then, they're required, as there may be bots running test-webkitpy on 2.5.

Leandro Pereira

Comment 26 2011-12-06 07:59:52 PST

Comment on attachment 117741 [details] Fixed per Eric's review View in context: https://bugs.webkit.org/attachment.cgi?id=117741&action=review I like this; cleaner than my solution. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:115 > + self._longest_filename = max([len(path) - len(self._scm.checkout_root) for path in changelog_paths]) There's no need to create a temporary list here; just use a generator: self._longest_filename = max(len(path) ... for path in changelog_paths) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:149 > + self._summary['contributors_with_reviews'] = sum([1 for contributor in self._contributors_statistics.values() if contributor['reviews']['total']]) Instead of sum(), you could use len(). Also, there's no need to create a temporary list, just use a generator here: ... = len(1 for contributor in ... if ...) > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:206 > + i += 1 No-op. You're iterating over a generator created by enumerate(), ``i'' will be overwritten in the next step. If you want to start from 1, you can use the start parameter of enumerate: enumerate(entries, start=1). I'm not sure if it's available in the minimum version of Python we're supposed to support, but since you're using ``i'' only to call _print_status(), printing (i + 1) would suffice here.

Ryosuke Niwa

Comment 27 2011-12-08 09:10:28 PST

Ping reviewers.

Eric Seidel (no email)

Comment 28 2011-12-08 11:39:26 PST

Comment on attachment 117741 [details] Fixed per Eric's review View in context: https://bugs.webkit.org/attachment.cgi?id=117741&action=review I expect we could make this nicer with a few more rounds. But this is OK to go. Thanks for hte patch. Please fix my and leandro's comments before landing. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 > + print commands_dir Do you wnat this printing? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:137 > + if self._filename: > + print Hum? > Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:65 > + test_json = {'array.json': [1, 2, 3, {'key': 'value'}], 'dictionary.json': {'somekey': 'somevalue', 'array': [4, 5]}} > + sys_stdout = sys.stdout > + try: > + sys.stdout = filesystem.open_text_file_for_writing('stdout') > + AnalyzeChangeLog._generate_jsons(filesystem, test_json, 'bar') > + self.assertEqual(set(filesystem.files.keys()), set(['bar/array.json', 'bar/dictionary.json', 'stdout'])) > + finally: > + sys.stdout = sys_stdout Huh? do you want OutputCapture? That's our testing way of overriding sys.stdout. > Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:148 > + sys_stdout = sys.stdout > + try: > + sys.stdout = host.filesystem.open_text_file_for_writing('stdout') > + analyzer = ChangeLogAnalyzer(host, ['mock-checkout/foo/ChangeLog']) > + analyzer.analyze() > + finally: > + sys.stdout = sys_stdout Again, outputcapture.

Ryosuke Niwa

Comment 29 2012-01-03 13:25:23 PST

Committed r103959: <http://trac.webkit.org/changeset/103959>

Evan Martin

Comment 30 2012-01-03 16:45:25 PST

I just saw the git question. In general git svn is a weak shell around underlying git commands, so if you need to do something fancy with an svn revision, the right way is to map it to a git hash first. E.g. hash = git svn find-rev r1234 git blame --complex-git-flags $hash

Note You need to log in before you can comment on or make changes to this bug.

Status RESOLVED

Resolution FIXED

Priority P2

Severity Normal

Classification Unclassified

Version 528+ (Nightly build)

Hardware Unspecified

OS Unspecified

Product WebKit

Component Tools / Tests

Assignee

Ryosuke Niwa

Reported

2011-11-13 23:29 PST

Modified

2012-01-03 16:45 PST History

CC List

8 users Show

URL

Keywords

Depends on

72351 72540 72566 72636 72690 72703

Blocks

68061

Dependencies

tree graph