WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
72243
Add a tool to analyze change logs
https://bugs.webkit.org/show_bug.cgi?id=72243
Summary
Add a tool to analyze change logs
Ryosuke Niwa
Reported
2011-11-13 23:29:32 PST
We need a tool that analyzes change logs in order to figure out areas of contributions of each contributor
Attachments
work in progress
(16.81 KB, patch)
2011-11-13 23:30 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
json files generated by the script + summary.html
(
deleted
)
2011-11-13 23:31 PST
,
Ryosuke Niwa
no flags
Details
work in progress 2
(18.38 KB, patch)
2011-11-15 22:55 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
json files + summary.html version 2
(
deleted
)
2011-11-15 22:56 PST
,
Ryosuke Niwa
no flags
Details
screenshot
(
deleted
)
2011-11-16 01:39 PST
,
Ryosuke Niwa
no flags
Details
work in progress 3
(23.83 KB, patch)
2011-11-20 14:22 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Initial patch
(41.98 KB, patch)
2011-11-22 13:48 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Fixed per Eric's comment; also updated for ToT
(40.27 KB, patch)
2011-11-22 15:16 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Fixed per Eric's comment
(39.88 KB, patch)
2011-11-30 16:15 PST
,
Ryosuke Niwa
no flags
Details
Formatted Diff
Diff
Fixed per Eric's review
(38.53 KB, patch)
2011-12-02 22:05 PST
,
Ryosuke Niwa
eric
: review+
Details
Formatted Diff
Diff
Show Obsolete
(7)
View All
Add attachment
proposed patch, testcase, etc.
Ryosuke Niwa
Comment 1
2011-11-13 23:30:26 PST
Created
attachment 114885
[details]
work in progress
Ryosuke Niwa
Comment 2
2011-11-13 23:31:13 PST
Created
attachment 114886
[details]
json files generated by the script + summary.html
Ryosuke Niwa
Comment 3
2011-11-15 22:55:32 PST
Created
attachment 115327
[details]
work in progress 2
Ryosuke Niwa
Comment 4
2011-11-15 22:56:37 PST
Created
attachment 115328
[details]
json files + summary.html version 2
Ryosuke Niwa
Comment 5
2011-11-16 01:39:39 PST
Created
attachment 115348
[details]
screenshot
Ryosuke Niwa
Comment 6
2011-11-20 14:22:47 PST
Created
attachment 115998
[details]
work in progress 3
Ryosuke Niwa
Comment 7
2011-11-20 23:05:47 PST
I'm stuck. Apparently, there's no way to get "git svn blame <file>" at a specific SVN revision. This is necessary because git svn blame for WebCore/ChangeLog-2011-10-19 consists entirely of
r97937
. The only solution I can think of is to rely on each commit and retrieve change log entries added by each commit (by git/svn diff). But this will be substantially slower (we'll end up spending 1-2s per revision) so we'll need 27 hours to process 100,000 entries.
Eric Seidel (no email)
Comment 8
2011-11-20 23:08:49 PST
I'm not sure I understand. Why not just go back to before
r97937
and do the blame?
Ryosuke Niwa
Comment 9
2011-11-20 23:14:17 PST
(In reply to
comment #8
)
> I'm not sure I understand. Why not just go back to before
r97937
and do the blame?
Do you know how to do that on git? I've spent hours googling about that but couldn't figure it out.
Eric Seidel (no email)
Comment 10
2011-11-20 23:23:15 PST
Short of checking out to that revision, I'm not sure. Evan is a third-order git ninja. I suggest asking him.
Ryosuke Niwa
Comment 11
2011-11-20 23:40:36 PST
FYI, we can do this on a svn client by "svn blame -r<revision number> <paths>"
Eric Seidel (no email)
Comment 12
2011-11-21 00:37:33 PST
I believe git-svn has a secret hidden svn checkout somewhere... so maybe we could use that?
Ryosuke Niwa
Comment 13
2011-11-21 16:12:43 PST
Okay, I'm just going to add the current version that doesn't support parsing change logs from multiple directories for now so that other contributors can at least play with what I have so far.
Ryosuke Niwa
Comment 14
2011-11-22 13:48:47 PST
Created
attachment 116271
[details]
Initial patch
Eric Seidel (no email)
Comment 15
2011-11-22 14:22:31 PST
Comment on
attachment 116271
[details]
Initial patch View in context:
https://bugs.webkit.org/attachment.cgi?id=116271&action=review
It's a big patch, I'll have to read through this again.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:123 > + if is_rollover: > + self._author = None > + self._author_name = None > + self._author_email = None > + self._reviewer_text = None > + self._reviewers = [] > + self._touched_files = []
I'm not a big fan of having a special entry for roll-overs...
> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386 > + return ''
Should this be none? unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31 > +from time import time
I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports).
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54 > + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)]
self._tool.filesystem. :) Why is this static?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61 > + def _generate_jsons(filesystem, json_files, output_dir):
Why static?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71 > + filesystem = FileSystem()
The tool has a filesystem. :) tool.filesystem
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 > + self._filesystem = filesystem
Yeah, I think you want to pass this a Host or a Tool object. Tool is a subclass of Host, which is a subclass of SystemHost.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116 > + self._areas_statistics = {} > + for area in self._contribution_areas.names(): > + self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}
This can be done as a list comprehension, it's just slightly gnarly: self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()])
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191 > + i = 0 > + dirname = self._filesystem.dirname(changelog_path) > + for entry in entries:
You want for entry, index in enumerate(entries): :)
Ryosuke Niwa
Comment 16
2011-11-22 14:54:27 PST
Comment on
attachment 116271
[details]
Initial patch View in context:
https://bugs.webkit.org/attachment.cgi?id=116271&action=review
>> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:386 >> + return '' > > Should this be none? unicode() is probably more accurate (and python 3 compatiable) than '' if we decide '' is right.
Fixed.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:31 >> +from time import time > > I tend to use import foo for system imports, instead of the from syntax (which I always use for webkitpy imports).
It seemed silly to repeat time.time() but okay. Fixed.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:54 >> + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)] > > self._tool.filesystem. :) Why is this static?
For testing.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:61 >> + def _generate_jsons(filesystem, json_files, output_dir): > > Why static?
Ditto.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:71 >> + filesystem = FileSystem() > > The tool has a filesystem. :) tool.filesystem
Fixed.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 >> + self._filesystem = filesystem > > Yeah, I think you want to pass this a Host or a Tool object. Tool is a subclass of Host, which is a subclass of SystemHost.
Fixed.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:116 >> + self._areas_statistics[area] = {'reviewed': 0, 'unreviewed': 0, 'contributors': {}} > > This can be done as a list comprehension, it's just slightly gnarly: > > self._areas_statistics = dict([(area, {'reviewed': 0, 'unreviewed': 0, 'contributors': {}}) for area in self._contribution_areas.names()])
Fixed.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:191 >> + for entry in entries: > > You want for entry, index in enumerate(entries): :)
Hm... I had to skip entires without author names in some changelogs but I guess it's okay to assert(entry.authors()) for now.
Ryosuke Niwa
Comment 17
2011-11-22 14:55:33 PST
Comment on
attachment 116271
[details]
Initial patch View in context:
https://bugs.webkit.org/attachment.cgi?id=116271&action=review
>> Tools/Scripts/webkitpy/common/checkout/changelog.py:123 >> + self._touched_files = [] > > I'm not a big fan of having a special entry for roll-overs...
I actually realized that this rollover business is not needed for this iteration so will revert the entire change.
Ryosuke Niwa
Comment 18
2011-11-22 15:16:51 PST
Created
attachment 116281
[details]
Fixed per Eric's comment; also updated for ToT
Ryosuke Niwa
Comment 19
2011-11-23 11:56:31 PST
Ping eric.
Eric Seidel (no email)
Comment 20
2011-11-30 11:43:37 PST
Comment on
attachment 116281
[details]
Fixed per Eric's comment; also updated for ToT View in context:
https://bugs.webkit.org/attachment.cgi?id=116281&action=review
> Tools/Scripts/webkitpy/common/checkout/changelog.py:279 > + if rolled_over_regexp.match(line): > + break
There can never (or should never?) be anything after the rolled over line?
> Tools/Scripts/webkitpy/common/system/filesystem_mock.py:400 > + raise StopIteration
You're blowing my mind.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:207 > + i += 1
You dont' need this anymore. :)
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:209 > + return i
Can this work? Isn't i tied to teh context of the for loop?
Ryosuke Niwa
Comment 21
2011-11-30 16:15:26 PST
Created
attachment 117290
[details]
Fixed per Eric's comment
Eric Seidel (no email)
Comment 22
2011-11-30 16:34:49 PST
Comment on
attachment 117290
[details]
Fixed per Eric's comment View in context:
https://bugs.webkit.org/attachment.cgi?id=117290&action=review
I didn't review the html/js at all. The python looks generally right, but probably needs one more round.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:279 > + if rolled_over_regexp.match(line): > + break
You should comment about why break is OK. I'm sorry I didn't make that clear in my previous commetn. Nothing big, but it's just not immediately obvious why we'd terminate the loop here.
> Tools/Scripts/webkitpy/common/checkout/changelog.py:288 > + yield ChangeLogEntry(''.join(entry_lines[:-1])) > +
Why do we add this one at the end? Is this part of the chagne tested? This could make a good separate patch w/ test case. :)
> Tools/Scripts/webkitpy/common/config/contributionareas.py:181 > + def names(self): > + return [area.name() for area in self._contribution_areas]
I think map(operator.namegetter('name'), self._contribution_areas) is another way to write this. I'm not sure if that's any more clear or concise though.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:32 > +import time > +from webkitpy.common.checkout.scm.detection import SCMDetector
I normally put a blank line between system imports and webkitpy imports, but I don't think that's any sort of official style. You don't need to chagne it, just letting you know.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48 > + steps.Options.changelog_count,
What does this do? Is this how many changelogs back to search? Feels a bit strangely named. Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs. the options "git log" has make sense here. :)
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57 > + def _enumerate_changelogs(filesystem, dirname, changelog_count): > + changelogs = [filesystem.join(dirname, filename) for filename in filesystem.listdir(dirname) if re.match('^ChangeLog(-(\d{4}-\d{2}-\d{2}))?$', filename)] > + # Make sure ChangeLog shows up before ChangeLog-2011-01-01 > + changelogs = sorted(changelogs, key=lambda filename: filename + 'X', reverse=True) > + return changelogs[:changelog_count]
Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :) That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:60 > + def _generate_jsons(filesystem, json_files, output_dir):
jsons? json_files?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67 > + output_file = filesystem.open_text_file_for_writing(filesystem.join(output_dir, filename)) > + try: > + print ' Generating', filename > + output_file.write(json.dumps(json_files[filename], indent=2)) > + finally: > + output_file.close()
I think you want to use "with" instead. That will autoclose at the end of the with block. Also, why not just use write_text_file directly? Or even json.dump? (which can take a filehandle)! So many better ways...
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 > + commands_dir = filesystem.dirname(filesystem.abspath(__file__))
You want filesystem.path_to_module(self.__module__) instead, I think.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 > + self._filesystem = host.filesystem > + self._contribution_areas = ContributionAreas(host.filesystem) > + self._scm = host.scm()
I'm curious why you store the filesystem and scm separately instead of just storing the host?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:128 > + status = new_status + (' ' * max(len(self._previous_status) - len(new_status) - 1, 0))
local var for the max result might be helpful to clarity here.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129 > + print "\b" * len(self._previous_status) + status,
\b?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146 > + changelog = self._filesystem.open_text_file_for_reading(path) > + try: > + self._print_status('Parsing entries...') > + number_of_parsed_entries = self._analyze_entries(ChangeLog.parse_entries_from_file(changelog), path) > + finally: > + changelog.close()
Yup, again, "with" or "write_text_file" are better options.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156 > + area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0}
These dictionaries start to feel like objects (with default values in their constructor).
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:173 > + if contribution_type == 'reviews': > + statistics['total'] += 1 > + elif reviewed: > + statistics['reviewed'] += 1 > + else: > + statistics['unreviewed'] += 1
Yeah, you type these strings for these dictionaries several times here. :)
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:182 > + def _increment_dictionary_value(self, dictionary, key): > + dictionary[key] = dictionary.get(key, 0) + 1
Does setdefault() get you this?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204 > + for reviewer in reviewers_for_entry: > + self._collect_statistics_for_contributor(reviewer.full_name, 'reviews', areas_for_entry, touchedfiles_for_entry, reviewed=True) > + > + for author in authors_for_entry: > + self._collect_statistics_for_contributor(author['name'], 'patches', areas_for_entry, touchedfiles_for_entry, > + reviewed=bool(reviewers_for_entry)) > + > + for area in areas_for_entry: > + self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1
Should you just be passing entry to these function calls?
Ryosuke Niwa
Comment 23
2011-12-02 22:00:58 PST
Comment on
attachment 117290
[details]
Fixed per Eric's comment View in context:
https://bugs.webkit.org/attachment.cgi?id=117290&action=review
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:48 >> + steps.Options.changelog_count, > > What does this do? Is this how many changelogs back to search? Feels a bit strangely named. Feels like it's more interesting to think about searching back to a specific date instead of a number of Changelogs. the options "git log" has make sense here. :)
Going back to specific date is tricky because people put entries with wrong dates all the time. I need to develop some heuristics for that.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:57 >> + return changelogs[:changelog_count] > > Yeah, again I think we want to end up just sorting the entries instead of the logs themselves. :) That doesn't have to be this change, but i'm interested in your feel for that eventual design, if it makes sense.
In the current design, the order in which entries appear don't matter at all. If we really think the order matter then we probably want to parse logs per revision. Even though that'll take forever (ETA 27h), it'll guarantee the ordering.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:67 >> + output_file.close() > > I think you want to use "with" instead. That will autoclose at the end of the with block. Also, why not just use write_text_file directly? Or even json.dump? (which can take a filehandle)! So many better ways...
Rewritten using write_text_file.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 >> + commands_dir = filesystem.dirname(filesystem.abspath(__file__)) > > You want filesystem.path_to_module(self.__module__) instead, I think.
Done.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:108 >> + self._scm = host.scm() > > I'm curious why you store the filesystem and scm separately instead of just storing the host?
For ease of access and also because I don't really host.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:129 >> + print "\b" * len(self._previous_status) + status, > > \b?
Rewritten using \r.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:146 >> + changelog.close() > > Yup, again, "with" or "write_text_file" are better options.
Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:156 >> + area_contributors[contributor] = {'reviews': 0, 'reviewed': 0, 'unreviewed': 0} > > These dictionaries start to feel like objects (with default values in their constructor).
Yeah but then we'll end up having lots of them, and I need to define __str__ for them.
>> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:204 >> + self._areas_statistics[area]['reviewed' if reviewers_for_entry else 'unreviewed'] += 1 > > Should you just be passing entry to these function calls?
No. Calling _contribution_areas.areas_for_touched_files is very expensive. We can't afford to do that twice.
Ryosuke Niwa
Comment 24
2011-12-02 22:05:38 PST
Created
attachment 117741
[details]
Fixed per Eric's review
Eric Seidel (no email)
Comment 25
2011-12-02 23:30:34 PST
(In reply to
comment #23
)
> (From update of
attachment 117290
[details]
) > Here, I'm reading a file and parse_entries_from_file requires a file object. Reading the entire file into memory and converting it to StringIO turns out to be slow. Do I need __future__ import still for with? Since nobody's going to use this tool on Leopard, if this won't cause an error in module initialization, then I'd rather avoid adding the ugly __future__ import.
I will remove all the __future__ imports when we drop support for python 2.5. Until then, they're required, as there may be bots running test-webkitpy on 2.5.
Leandro Pereira
Comment 26
2011-12-06 07:59:52 PST
Comment on
attachment 117741
[details]
Fixed per Eric's review View in context:
https://bugs.webkit.org/attachment.cgi?id=117741&action=review
I like this; cleaner than my solution.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:115 > + self._longest_filename = max([len(path) - len(self._scm.checkout_root) for path in changelog_paths])
There's no need to create a temporary list here; just use a generator: self._longest_filename = max(len(path) ... for path in changelog_paths)
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:149 > + self._summary['contributors_with_reviews'] = sum([1 for contributor in self._contributors_statistics.values() if contributor['reviews']['total']])
Instead of sum(), you could use len(). Also, there's no need to create a temporary list, just use a generator here: ... = len(1 for contributor in ... if ...)
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:206 > + i += 1
No-op. You're iterating over a generator created by enumerate(), ``i'' will be overwritten in the next step. If you want to start from 1, you can use the start parameter of enumerate: enumerate(entries, start=1). I'm not sure if it's available in the minimum version of Python we're supposed to support, but since you're using ``i'' only to call _print_status(), printing (i + 1) would suffice here.
Ryosuke Niwa
Comment 27
2011-12-08 09:10:28 PST
Ping reviewers.
Eric Seidel (no email)
Comment 28
2011-12-08 11:39:26 PST
Comment on
attachment 117741
[details]
Fixed per Eric's review View in context:
https://bugs.webkit.org/attachment.cgi?id=117741&action=review
I expect we could make this nicer with a few more rounds. But this is OK to go. Thanks for hte patch. Please fix my and leandro's comments before landing.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:96 > + print commands_dir
Do you wnat this printing?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog.py:137 > + if self._filename: > + print
Hum?
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:65 > + test_json = {'array.json': [1, 2, 3, {'key': 'value'}], 'dictionary.json': {'somekey': 'somevalue', 'array': [4, 5]}} > + sys_stdout = sys.stdout > + try: > + sys.stdout = filesystem.open_text_file_for_writing('stdout') > + AnalyzeChangeLog._generate_jsons(filesystem, test_json, 'bar') > + self.assertEqual(set(filesystem.files.keys()), set(['bar/array.json', 'bar/dictionary.json', 'stdout'])) > + finally: > + sys.stdout = sys_stdout
Huh? do you want OutputCapture? That's our testing way of overriding sys.stdout.
> Tools/Scripts/webkitpy/tool/commands/analyzechangelog_unittest.py:148 > + sys_stdout = sys.stdout > + try: > + sys.stdout = host.filesystem.open_text_file_for_writing('stdout') > + analyzer = ChangeLogAnalyzer(host, ['mock-checkout/foo/ChangeLog']) > + analyzer.analyze() > + finally: > + sys.stdout = sys_stdout
Again, outputcapture.
Ryosuke Niwa
Comment 29
2012-01-03 13:25:23 PST
Committed
r103959
: <
http://trac.webkit.org/changeset/103959
>
Evan Martin
Comment 30
2012-01-03 16:45:25 PST
I just saw the git question. In general git svn is a weak shell around underlying git commands, so if you need to do something fancy with an svn revision, the right way is to map it to a git hash first. E.g. hash = git svn find-rev r1234 git blame --complex-git-flags $hash
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug