WebKit Bugzilla
New
Browse
Log In
×
Sign in with GitHub
or
Remember my login
Create Account
·
Forgot Password
Forgotten password account recovery
RESOLVED FIXED
Bug 37765
REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø
https://bugs.webkit.org/show_bug.cgi?id=37765
Summary
REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø
Eric Seidel (no email)
Reported
2010-04-18 00:10:27 PDT
REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø I fixed the commit-queue ignoring Tor as a reviewer, but now it throws an exception every time it tries to land a patch with his name in it. I created this bug by hand due to
bug 37764
. :(
Attachments
Patch
(39.54 KB, patch)
2010-04-18 04:14 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch
(53.77 KB, patch)
2010-04-18 04:59 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Fixed to be way more awesome
(88.59 KB, patch)
2010-04-21 03:33 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Fix Chromium Canaries
(3.67 KB, patch)
2010-04-21 15:47 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Fix for run-webkit-tests --chromium
(2.00 KB, patch)
2010-04-21 20:10 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Patch
(1.45 KB, patch)
2010-04-21 20:40 PDT
,
Eric Seidel (no email)
no flags
Details
Formatted Diff
Diff
Show Obsolete
(5)
View All
Add attachment
proposed patch, testcase, etc.
Chris Jerdonek
Comment 1
2010-04-18 02:09:47 PDT
I'm not sure if code like this will solve it:
http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/webkitpy/style/checker.py?rev=57467#L564
Whatever the answer is, we may want to expose the necessary helper methods in webkitpy/common/ since we should probably be supporting unicode characters across the board.
Eric Seidel (no email)
Comment 2
2010-04-18 02:14:34 PDT
I've now coded up two versions of this patch. Both have failed in different places. :( Unicode = trail of tears. The first version tried to deploy use of the unicode() class more widely. That ran into trouble because we will have to change a bunch of places to be more unicode() aware and correctly encode to utf-8 before calling things like file.write(). The second version tried to remove use of the unicode() class from our codebase and instead just encode Tor's name in utf8 in a str() object. Turns out that fails too, because if you call str("Tor Arne Vestb\xC3\xB8").encode("utf8") python notices that \xC3 is not a valid ascii character and throws an exception. So it seems like python is forcing us towards using unicode(). That's just gonna hurt. A lot.
Eric Seidel (no email)
Comment 3
2010-04-18 02:20:44 PDT
http://farmdev.com/talks/unicode/
is really pushing unicode() Not sure I'm ready for the koolaid.
Eric Seidel (no email)
Comment 4
2010-04-18 02:25:48 PDT
OK. I think I'm maybe convinced. We should take the unicode dive. I will need a few hours and a good stiff drink. Maybe monday.
Chris Jerdonek
Comment 5
2010-04-18 02:29:29 PDT
(In reply to
comment #4
)
> OK. I think I'm maybe convinced. > > We should take the unicode dive. I will need a few hours and a good stiff > drink. Maybe monday.
Briefly, what's the pain point?
Eric Seidel (no email)
Comment 6
2010-04-18 02:47:51 PDT
(In reply to
comment #5
)
> > We should take the unicode dive. I will need a few hours and a good stiff > > drink. Maybe monday. > > Briefly, what's the pain point?
What do you mean?
Chris Jerdonek
Comment 7
2010-04-18 02:55:08 PDT
(In reply to
comment #6
)
> > Briefly, what's the pain point? > > What do you mean?
I was just asking if you could explain briefly why it will hurt so much to start using it more widely (as you say below). And will that be true even if we use it just in the places we really need it?
> So it seems like python is forcing us towards using unicode(). That's just > gonna hurt. A lot.
Eric Seidel (no email)
Comment 8
2010-04-18 03:00:23 PDT
Every place that we log, call print, or call write we will need to encode. Many of our unit tests will need unicode versions added (eventually). Every time we open a file or read off the network (when we're not using BeautifulSoup) we will need to document what encoding we expect. Totally doable. Just will be a large change. Looking at it now.
Eric Seidel (no email)
Comment 9
2010-04-18 04:14:28 PDT
Created
attachment 53625
[details]
Patch
Eric Seidel (no email)
Comment 10
2010-04-18 04:15:14 PDT
Comment on
attachment 53625
[details]
Patch I'm sure this change will break things. We'll fix them when we find where it broke stuff.
WebKit Review Bot
Comment 11
2010-04-18 04:20:48 PDT
Attachment 53625
[details]
did not pass style-queue: Failed to run "WebKitTools/Scripts/check-webkit-style" exit_code: 1 WebKitTools/Scripts/webkitpy/common/net/buildbot.py:413: whitespace before ':' [pep8/E203] [5] WebKitTools/Scripts/webkitpy/common/system/autoinstall.py:127: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/steps/postdiff.py:47: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:265: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:407: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:427: at least two spaces before inline comment [pep8/E261] [5] Total errors found: 6 in 24 files If any of these errors are false positives, please file a bug against check-webkit-style.
Eric Seidel (no email)
Comment 12
2010-04-18 04:38:02 PDT
Comment on
attachment 53625
[details]
Patch Nope, not quite there. Will post a better patch in teh morning.
Eric Seidel (no email)
Comment 13
2010-04-18 04:59:06 PDT
Created
attachment 53626
[details]
Patch
Eric Seidel (no email)
Comment 14
2010-04-18 05:00:10 PDT
I think this one finally works. I ended up having to rip out a bunch of StringIO plumbing. Basically every time we use StringIO, we have to be sure to encode the unicode string to a utf-8 byte array so that StringIO can vend back byte arrays via its read() method.
Chris Jerdonek
Comment 15
2010-04-18 10:41:18 PDT
Thanks for taking this on and looking into this, Eric. (Good slide show in
comment 3
, btw.) A couple random comments before a couple comments on the patch: (1) As the slides suggest, we may want to consider decoding using "utf-8-sig" as a general practice instead of "utf-8": "To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig").... Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file." (from
http://docs.python.org/library/codecs.html#encodings-and-unicode
) (2) I came across this which might be useful somewhere (intelligent encoding auto-detection):
http://chardet.feedparser.org/
Adam Barth
Comment 16
2010-04-18 11:23:36 PDT
We should be careful about BOM proliferation. It might confuse tools that don't understand it.
Chris Jerdonek
Comment 17
2010-04-18 11:31:19 PDT
(In reply to
comment #16
)
> We should be careful about BOM proliferation. It might confuse tools that > don't understand it.
Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which would remove the BOM.
Chris Jerdonek
Comment 18
2010-04-18 11:36:29 PDT
+++ b/WebKitTools/ChangeLog @@ -1,3 +1,93 @@ + We do not have to use u"" instead of "" because u"a" == "a" as expected + in Python. Python will generate a warning to the console in cases where + a unicode() == str() operation cannot be performed. Can you clarify what you are trying to say here? It sounds like you might be saying more than you mean to say -- e.g. that the caller never needs to use unicode string literals, which is obviously not right. + All places which use StringIO need to be sure to pass StringIO a + pre-encoded byte-array (str object) instead of unicode so that + clients which read from the StringIO don't have encoding exceptions. It seems like we shouldn't need to use StringIO as much as we do (except perhaps in unittest code to make certain tests easier). It looks like you're removing a lot of the places it's being used which seems good. @@ -117,7 +118,9 @@ class ChangeLog(object): def latest_entry(self): # ChangeLog files are always UTF-8, we read them in as such to support Reviewers with unicode in their names. - changelog_file = codecs.open(self.path, "r", "utf-8") + # We don't use codecs.open here to make the api for parse_latest_entry_from_file clearer. + # If we did, then it would be unclear as to whos reponsibility decoding of the file should be. + changelog_file = open(self.path, "r") try: return self.parse_latest_entry_from_file(changelog_file) I understand what you're saying, but maybe the conclusion that should be drawn is that we shouldn't have parse_latest_entry_from_file() as part of our API? In other words, the caller should be responsible. That would also be more consistent with the rule of thumb to decode as early as possible. +++ b/WebKitTools/Scripts/webkitpy/common/system/deprecated_logging.py @@ -45,9 +45,10 @@ class tee: def __init__(self, *files): self.files = files - def write(self, string): + # Callers should pass an already encoded string for writing. + def write(self, bytes): for file in self.files: - file.write(string) + file.write(bytes) Doesn't this also go against "unicode everywhere/encode late"? Maybe this would be a good spot to do type-checking between unicode/str for backwards compatibility. +++ b/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/metered_stream.py @@ -56,6 +56,8 @@ class MeteredStream: self._stream = stream self._last_update = "" + # FIXME: Does this take a string (unicode) or an array of bytes (str)? + # If it takes a string, it needs to call txt.encode("utf-8") def write(self, txt): """Write text directly to the stream, overwriting and resetting the meter.""" I'm a bit unclear on the extent to which, for example, we'll need to be passing unicode strings even for every log message we create. You've probably thought about this more than I have. Out of curiosity, I started looking to see how Python's logging package handles this. I stopped when it started to look like the logging package has no preference between str and unicode. I think it might only be the particular logging handlers that care, which is what the caller has control over (when they configure logging).
Eric Seidel (no email)
Comment 19
2010-04-18 15:36:29 PDT
(In reply to
comment #18
) Good questions!
> Can you clarify what you are trying to say here? It sounds like you might be > saying more than you mean to say -- e.g. that the caller never needs to use > unicode string literals, which is obviously not right.
I was attempting to excuse why I didn't add u"" to every string literal in the project. :) If you forget the u"" and it actually contains non-ascii code points, python will yell at you.
> It seems like we shouldn't need to use StringIO as much as we do (except > perhaps in unittest code to make certain tests easier). It looks like you're > removing a lot of the places it's being used which seems good.
StringIO is just confusingly named. It deals with a stream of bytes. aka, a "str" object in Python 2.x. In Python 3.0 "str" is actually a unicode() object, and there is an explicit bytes() type. StringIO probably takes a bytes() object in 3.0. :) There are few places we actually need StringIO, since we tend to follow the model of just reading everything into memory. The files we deal with tend to be small, and the machines these scripts run on rather beefy.
> I understand what you're saying, but maybe the conclusion that should be drawn > is that we shouldn't have parse_latest_entry_from_file() as part of our API? > In other words, the caller should be responsible. That would also be more > consistent with the rule of thumb to decode as early as possible.
Any API which takes a file-like object is rather confusing. Because the read() method is expected to return a byte stream. However you can use codecs.open() to create a file-like object which will return unicode() strings instead of str() byte streams from the various methods.
> Doesn't this also go against "unicode everywhere/encode late"? Maybe this > would be a good spot to do type-checking between unicode/str for backwards > compatibility.
Nope. .write() is expected to deal with byte streams. Think of file.write()
> I'm a bit unclear on the extent to which, for example, we'll need to be passing > unicode strings even for every log message we create. You've probably thought > about this more than I have.
Well, depends on the logging infrastructure. As far as I can tell Python 2.5 "print" seems to be smart enough to handle unicode() objects. file.write() not so much (unless you open the file with codecs.open and declare an encoding) So depends on what sort of logging we're doing. If we're logging to the console, it seems python is smart enough to handle unicode for us. If we're logging to a file we may have to be careful to call .encode("utf-8") to create the appropriate byte stream first.
> Out of curiosity, I started looking to see how Python's logging package handles > this. I stopped when it started to look like the logging package has no > preference between str and unicode. I think it might only be the particular > logging handlers that care, which is what the caller has control over (when > they configure logging).
There is not difference between "str" and "unicode" as far as most of python is concerned. I wouldn't expect the logging module to care, except if it's writing to files or to the console (however I suspect that when writing to the console the unicode encoding is abstracted at a lower layer).
Eric Seidel (no email)
Comment 20
2010-04-18 15:38:01 PDT
(In reply to
comment #17
)
> (In reply to
comment #16
) > > We should be careful about BOM proliferation. It might confuse tools that > > don't understand it. > > Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which > would remove the BOM.
I don't think we need to use utf-8-sig anywhere. Our tools don't (and shouldn't) write out BOMs and we know the encodings of all our files. It's possible that editing ChangeLog files in MSVC would spit out a BOM in the file. If that's the case, we might have to use "utf-8-sig" when reading in ChangeLog files to be safe.
Tony Chang
Comment 21
2010-04-18 17:48:25 PDT
Comment on
attachment 53626
[details]
Patch In response to
comment #2
, I think it would work if you use decode instead of encode (the string you passed in is already utf8 encoded, you want to convert it to a unicode object, right?). FWIW, I would probably have tried to keep everything as a utf8 encoded str, but using unicode everywhere seems fine too (you get some better type checking this way). In either case, thanks for taking this on. Some small style nits below.
> diff --git a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > index e93896f..862dc48 100644 > --- a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > +++ b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > @@ -102,12 +102,13 @@ class ChangeLog(object): > date_line_regexp = re.compile(ChangeLogEntry.date_line_regexp) > entry_lines = [] > # The first line should be a date line. > - first_line = changelog_file.readline() > + first_line = unicode(changelog_file.readline(), "utf-8")
Nit: I find using .decode('utf-8') easier to read than calling unicode(), but either is fine with me. There's another instance of this in the same file.
> diff --git a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > index bb22d14..2dd4d20 100644 > --- a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > +++ b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > def _parse_attachment_element(self, element, bug_id): > +
Nit: Did you mean to add this blank line?
> diff --git a/WebKitTools/Scripts/webkitpy/common/system/executive.py b/WebKitTools/Scripts/webkitpy/common/system/executive.py > index b6126e4..cbe0e1f 100644 > --- a/WebKitTools/Scripts/webkitpy/common/system/executive.py > +++ b/WebKitTools/Scripts/webkitpy/common/system/executive.py > @@ -145,7 +157,7 @@ class Executive(object): > # os.kill isn't available on Windows. However, when I tried it > # using Cygwin, it worked fine. We should investigate whether > # we need this platform specific code here. > - subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)), > + subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)),
Nit: This can probably stay as str(). It's a bit weird to have the other strings be str and the pid be unicode.
Chris Jerdonek
Comment 22
2010-04-18 19:59:38 PDT
(In reply to
comment #19
) Thanks for your responses, Eric.
> > I understand what you're saying, but maybe the conclusion that should be drawn > > is that we shouldn't have parse_latest_entry_from_file() as part of our API? > > In other words, the caller should be responsible. That would also be more > > consistent with the rule of thumb to decode as early as possible. > > Any API which takes a file-like object is rather confusing. Because the read() > method is expected to return a byte stream. However you can use codecs.open() > to create a file-like object which will return unicode() strings instead of > str() byte streams from the various methods.
I guess what I was getting at here is that -- following the "decode early" mantra -- we might want to have parse_latest_entry_from_file() accept a file-like object from codecs.open() rather than a UTF-8 encoded stream. That way parse_latest_entry_from_file() doesn't have to be encoding aware and will see only unicode strings. I only bring this up because this issue will probably come up in more spots as we review more of our code. Any time we define a function that takes a file-like object representing text, we'll need to decide if that function should take a UTF-8 encoded stream, say, or an encoding-aware stream from codecs. (The codecs module blurs the recommendations in the slide show somewhat because you can "decode early" while still seeming to handle a stream of bytes.) Incidentally, the Python documentation notes that in Python 3, "there is no longer any need for using the encoding-aware streams in the codecs module." (from
http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
)
Eric Seidel (no email)
Comment 23
2010-04-18 22:20:31 PDT
(In reply to
comment #21
)
> (From update of
attachment 53626
[details]
) > In response to
comment #2
, I think it would work if you use decode instead of > encode (the string you passed in is already utf8 encoded, you want to convert > it to a unicode object, right?).
My example was slightly confusing, and I was myself slightly confused. Now that I see "str" as an array of bytes and not actually a string type the world makes a lot more sense. I was attempting to unconditionally encode any "string like object" to a utf-8 bytestream before passing off to the Popen call. This meant that if I was passing around utf8-encoded str objects, and tried to call "encode" on them, it would scream at me.
> FWIW, I would probably have tried to keep everything as a utf8 encoded str, but > using unicode everywhere seems fine too (you get some better type checking this > way). In either case, thanks for taking this on.
Yes, it's true, that could work. However there are already many APIs which feed us unicode() objects and so we'd have to be careful to be sure to convert those back to utf-8 byte stream (str) objects before passing them around or comparisons would fail. I decided to go with the unicode() solution, mostly because it seemed the most consistent in my head. Turns out it's also the recommended way to prepare for Python 3.0 (which we'll eventually get to in a year or more). :)
> > - first_line = changelog_file.readline() > > + first_line = unicode(changelog_file.readline(), "utf-8") > > Nit: I find using .decode('utf-8') easier to read than calling unicode(), but > either is fine with me. There's another instance of this in the same file.
Yes. The only difference is handling of None. However I agree in this case decode is definitely nicer.
> > def _parse_attachment_element(self, element, bug_id): > > + > > Nit: Did you mean to add this blank line?
Nope. Will fix.
> > - subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)), > > + subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)), > > Nit: This can probably stay as str(). It's a bit weird to have the other > strings be str and the pid be unicode.
It certainly could. I was trying to move away from ever using the str() constructor. str() is basically always bad news, even when we use it to mean "make this number a string".
Eric Seidel (no email)
Comment 24
2010-04-18 22:23:37 PDT
(In reply to
comment #22
)
> I guess what I was getting at here is that -- following the "decode early" > mantra -- we might want to have parse_latest_entry_from_file() accept a > file-like object from codecs.open() rather than a UTF-8 encoded stream. That > way parse_latest_entry_from_file() doesn't have to be encoding aware and will > see only unicode strings.
Hmm... Maybe. Seems a bit magical. We'd have to assert that .read() returned isinstance(unicode) values or something to be sure. You're right that it would make the code read more like Python 3.0 would. :)
> I only bring this up because this issue will probably come up in more spots as > we review more of our code. Any time we define a function that takes a > file-like object representing text, we'll need to decide if that function > should take a UTF-8 encoded stream, say, or an encoding-aware stream from > codecs.
Yes. And your point is valid. Not sure how to make that work with StringIO for unittesting though. If I remember correctly StringIO(unicode(...)) will throw exceptions when read() is called.
> (The codecs module blurs the recommendations in the slide show somewhat because > you can "decode early" while still seeming to handle a stream of bytes.) > > Incidentally, the Python documentation notes that in Python 3, "there is no > longer any need for using the encoding-aware streams in the codecs module." > > (from >
http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
> )
Thank you. :) Very helpful comments as always.
Chris Jerdonek
Comment 25
2010-04-19 13:52:05 PDT
(In reply to
comment #24
)
> Yes. And your point is valid. Not sure how to make that work with StringIO > for unittesting though. If I remember correctly StringIO(unicode(...)) will > throw exceptions when read() is called.
I haven't tried this out myself to check, but the Python documentation seems to say this is okay as long as one doesn't mix the two:
http://docs.python.org/library/stringio.html#StringIO.StringIO
Eric Seidel (no email)
Comment 26
2010-04-19 15:02:16 PDT
You're right, StringIO(unicode()) seems to work fine:
>>> import StringIO >>> tor = u"Vestb\u00f8" >>> print tor
Vestbø
>>> tor_io =StringIO.StringIO(tor) >>> read_tor =tor_io.read() >>> print read_tor
Vestbø
>>> print read_tor.__class__
<type 'unicode'>
Eric Seidel (no email)
Comment 27
2010-04-19 18:33:55 PDT
I can haz r+ or r- plz?
Adam Barth
Comment 28
2010-04-20 11:58:36 PDT
Looking.
Adam Barth
Comment 29
2010-04-20 12:08:57 PDT
Comment on
attachment 53626
[details]
Patch Oh man. This looks like a good approach. Now we get to see how good our test coverage is. Btw, we should write up a doc about reviewing/writing webkitpy code that explains these kinds of constraints we have going forward.
Chris Jerdonek
Comment 30
2010-04-20 12:19:30 PDT
(In reply to
comment #29
)
> Btw, we should write up a doc about reviewing/writing webkitpy code that > explains these kinds of constraints we have going forward.
I'm not sure what kind of doc you're envisioning, but I've started to put general information about WebKit Python and webkitpy here:
http://trac.webkit.org/wiki/PythonGuidelines
(linked under the "Documentation" section on the main wiki page)
Tor Arne Vestbø
Comment 31
2010-04-20 12:25:56 PDT
Much love for looking into this Eric!
Adam Barth
Comment 32
2010-04-20 12:31:06 PDT
That wiki page looks great. Would you be willing to add some info about our Unicode strategy and good and bad patterns?
Chris Jerdonek
Comment 33
2010-04-20 12:44:29 PDT
(In reply to
comment #32
)
> That wiki page looks great. Would you be willing to add some info about our > Unicode strategy and good and bad patterns?
Sure, I could do that. For now I stubbed out a section with a link to this comment thread. But don't let that stop anyone else from adding to that section if I don't get to it right away. :)
Eric Seidel (no email)
Comment 34
2010-04-20 12:55:00 PDT
Committed
r57907
: <
http://trac.webkit.org/changeset/57907
>
WebKit Review Bot
Comment 35
2010-04-20 13:54:49 PDT
http://trac.webkit.org/changeset/57907
might have broken Windows Debug (Tests) The following changes are on the blame list:
http://trac.webkit.org/changeset/57904
http://trac.webkit.org/changeset/57905
http://trac.webkit.org/changeset/57906
http://trac.webkit.org/changeset/57907
http://trac.webkit.org/changeset/57908
Eric Seidel (no email)
Comment 36
2010-04-20 14:30:40 PDT
Reverted
r57907
for reason: Appears to have broken MacEWS and possibly webkit-patch upload for svn users. Needs further investigation. Committed
r57920
: <
http://trac.webkit.org/changeset/57920
>
Eric Seidel (no email)
Comment 37
2010-04-20 14:31:52 PDT
The chromium bots also saw a failure in new-run-webkit-tests:
http://build.chromium.org/buildbot/waterfall.fyi/builders/Webkit%20Linux%20(webkit.org)/builds/23994/steps/webkit_tests/logs/stdio
Dirk Pranke
Comment 38
2010-04-20 14:50:39 PDT
So, to summarize the thinking in this patch: 1) assume all text files are encoded on disk as UTF-8 2) read files using codecs.open(..., "r", encoding="utf-8") 3) write files using codecs.open(..., "w", enocding="utf-8"), write(s.encode("utf-8")) if s is a string that is not known to be encoded already Correct? It would be good if we provided wrappers for (2) and (3) that handled this automatically and centralized the open and close logic in one place. Perhaps the read_file() and write_file() utility routines in dryrun.py that are supposed to be split off into their own file would be a good start on standardizing this.
Eric Seidel (no email)
Comment 39
2010-04-20 14:55:36 PDT
(In reply to
comment #38
)
> So, to summarize the thinking in this patch: > > 1) assume all text files are encoded on disk as UTF-8
More precisely: be explicit about how things are encoded, whatever that encoding may be.
> 2) read files using codecs.open(..., "r", encoding="utf-8")
Possibly. This patch uses a combination. codecs.open makes file object return unicode() strings instead of str byte arrays from read() and friends. We use that in some palces. cjerdonek was suggesting we standardize, but I didn't do a good job of that in this patch.
> 3) write files using codecs.open(..., "w", enocding="utf-8"), > write(s.encode("utf-8")) > if s is a string that is not known to be encoded already
One or the other, not both. if you open with codecs.open it will do the magic for you. if you open with open() you have to encode to a byte array (str) before calling write.
> It would be good if we provided wrappers for (2) and (3) that handled this > automatically and centralized the open and close logic in one place. > > Perhaps the read_file() and write_file() utility routines in dryrun.py that are > supposed to be split off into their own file would be a good start on > standardizing this.
Possibly. I think Chris is probably right. We should probably standarize on using codecs.open everywhere and then be able to assume that read/write handle unicode (like they do in Python 3.0).
Dirk Pranke
Comment 40
2010-04-20 15:10:13 PDT
(In reply to
comment #39
)
> (In reply to
comment #38
) > > So, to summarize the thinking in this patch: > > > > 1) assume all text files are encoded on disk as UTF-8 > > More precisely: be explicit about how things are encoded, whatever that > encoding may be. >
I was (perhaps badly) attempting to be stronger than that. What I meant was, previously we had probably assumed that all of the files are ASCII. We don't want to assume that, but since we don't want to try and do encoding detection, we should probably assume something, and that should probably be UTF-8.
> > 2) read files using codecs.open(..., "r", encoding="utf-8") > > Possibly. This patch uses a combination. codecs.open makes file object return > unicode() strings instead of str byte arrays from read() and friends. We use > that in some palces. cjerdonek was suggesting we standardize, but I didn't do > a good job of that in this patch.
I agree with Chris we should standardize, although it doesn't have to happen in this patch. We should file a separate bug to clean up the rest of the open() calls.
> > > 3) write files using codecs.open(..., "w", enocding="utf-8"), > > write(s.encode("utf-8")) > > if s is a string that is not known to be encoded already > > One or the other, not both. if you open with codecs.open it will do the magic > for you. if you open with open() you have to encode to a byte array (str) > before calling write.
Okay. I wasn't clear about that.
Eric Seidel (no email)
Comment 41
2010-04-20 15:14:52 PDT
(In reply to
comment #40
)
> (In reply to
comment #39
) > > (In reply to
comment #38
) > > > So, to summarize the thinking in this patch: > > > > > > 1) assume all text files are encoded on disk as UTF-8 > > > > More precisely: be explicit about how things are encoded, whatever that > > encoding may be. > > > > I was (perhaps badly) attempting to be stronger than that. What I meant was, > previously we had probably assumed that all of the files are ASCII. We don't > want to assume that, but since we don't want to try and do encoding detection, > we should probably assume something, and that should probably be UTF-8.
Except they weren't ascii. :) The exception Chromium hit when this was landed was due to the diff-processing code trying to write out a unicode string. Previously we were blissfully unaware that diff/wdiff/prettypatch were returning us utf-8 encoded data and we were passing it around as str()s. After this patch, we convert data returned from commands back into unicode() strings. Making some things easier, but requiring us to be explicit about converting our unicode strings back to byte arrays before writing them out to files (or using codecs.open to make our write() calls transparently convert).
WebKit Review Bot
Comment 42
2010-04-20 15:17:03 PDT
http://trac.webkit.org/changeset/57920
might have broken Leopard Intel Release (Tests) The following changes are on the blame list:
http://trac.webkit.org/changeset/57920
http://trac.webkit.org/changeset/57921
Dirk Pranke
Comment 43
2010-04-20 15:21:36 PDT
(In reply to
comment #41
)
> (In reply to
comment #40
) > > (In reply to
comment #39
) > > > (In reply to
comment #38
) > > > > So, to summarize the thinking in this patch: > > > > > > > > 1) assume all text files are encoded on disk as UTF-8 > > > > > > More precisely: be explicit about how things are encoded, whatever that > > > encoding may be. > > > > > > > I was (perhaps badly) attempting to be stronger than that. What I meant was, > > previously we had probably assumed that all of the files are ASCII. We don't > > want to assume that, but since we don't want to try and do encoding detection, > > we should probably assume something, and that should probably be UTF-8. > > Except they weren't ascii. :)
Agreed, obviously. I wasn't trying to suggest there weren't bugs we were runing into.
Eric Seidel (no email)
Comment 44
2010-04-21 03:33:19 PDT
Created
attachment 53937
[details]
Fixed to be way more awesome
Eric Seidel (no email)
Comment 45
2010-04-21 03:35:57 PDT
I've updated the patch to deploy "codecs.open" everywhere, as per Chris's suggestion. While I was looking at every open() call, I converted a bunch of them to use "with" to avoid file descriptor leaks. The patch is huge now. However, it does fix the 3 regressions seen when landing the last patch: 1. mac-ews was failing due to having changed how --status-host was passed, our global argument parsing is a hack, at least I've documented that now. 2. webkit-patch upload was broken for svn users due to the svn-create-patch call not passing its arguments as an array. 3. new-run-webkit-tests was broken for chromium due to diff creation logic not specifying an encoding.
Adam Barth
Comment 46
2010-04-21 13:46:06 PDT
Comment on
attachment 53937
[details]
Fixed to be way more awesome This change is too large to convince myself that it's correct, but the approach seems good and we want to do it all at once. One thought: how much of this can we wrap in a filesystem abstraction? I discussed that idea with Dirk on another bug. I don't think we need to do that in this patch, but it's something to keep in mind.
Eric Seidel (no email)
Comment 47
2010-04-21 13:50:05 PDT
The file system abstraction is really codecs.open() I learned at the end of this change that I could have even deployed codecs.open wider. We should *never* be using open() directly. codecs.open takes an encoding parameter (just like Python 3.x open() does). encoding=None results in open()-like, return-a-str-object behavior. So in a followup patch we should remove all uses of raw open(). We should also avoid ever closing a file manually. Doing so will fail in teh case of an exception, which is why we use with blocks. :)
Dirk Pranke
Comment 48
2010-04-21 14:06:25 PDT
(In reply to
comment #47
)
> The file system abstraction is really codecs.open() I learned at the end of > this change that I could have even deployed codecs.open wider. > > We should *never* be using open() directly. codecs.open takes an encoding > parameter (just like Python 3.x open() does). encoding=None results in > open()-like, return-a-str-object behavior. > > So in a followup patch we should remove all uses of raw open(). > > We should also avoid ever closing a file manually. Doing so will fail in teh > case of an exception, which is why we use with blocks. :)
Note that with blocks only do try/finally; you should still be aware of needing to do try/catch if you don't want the exceptions to propagate.
Chris Jerdonek
Comment 49
2010-04-21 14:18:05 PDT
(In reply to
comment #47
)
> We should *never* be using open() directly. codecs.open takes an encoding > parameter (just like Python 3.x open() does). encoding=None results in > open()-like, return-a-str-object behavior.
I just wanted to bring attention to a couple other uses of the codecs module that I noticed in check-webkit-style -- in case they might be useful elsewhere in webkitpy. Mainly this code-- # Change stderr to write with replacement characters so we don't die # if we try to print something containing non-ASCII characters. stderr = codecs.StreamReaderWriter(sys.stderr, codecs.getreader('utf8'), codecs.getwriter('utf8'), 'replace') # Setting an "encoding" attribute on the stream is necessary to # prevent the logging module from raising an error. See # the checker.configure_logging() function for more information. stderr.encoding = "UTF-8" # FIXME: Change webkitpy.style so that we do not need to overwrite # the global sys.stderr. This involves updating the code to # accept a stream parameter where necessary, and not calling # sys.stderr explicitly anywhere. sys.stderr = stderr (This is from
http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/check-webkit-style?rev=57467#L61
) IIRC, this codecs.StreamReaderWriter code was present in its basic form from the beginning. I'm not sure when we might want to use this pattern over codecs.open(), but perhaps it might be useful elsewhere. Note also the use of the errors='replace' parameter, which codecs.open() also accepts. The "errors" parameter in these methods controls how bad bytes should be handled when encountered while reading from or writing to a stream. There may be cases where we want to be more lenient than the default value of errors='strict' and continue processing a file if possible. It doesn't look like the errors parameter is being used in any of the codecs.open() calls in this patch (which could very well be exactly what we want).
Eric Seidel (no email)
Comment 50
2010-04-21 15:04:36 PDT
Committed
r58014
: <
http://trac.webkit.org/changeset/58014
>
Eric Seidel (no email)
Comment 51
2010-04-21 15:47:27 PDT
Created
attachment 53998
[details]
Fix Chromium Canaries
Eric Seidel (no email)
Comment 52
2010-04-21 15:52:39 PDT
Committed
r58016
: <
http://trac.webkit.org/changeset/58016
>
Eric Seidel (no email)
Comment 53
2010-04-21 16:31:39 PDT
Committed
r58023
: <
http://trac.webkit.org/changeset/58023
>
Eric Seidel (no email)
Comment 54
2010-04-21 18:23:39 PDT
Was rolled out a second time.
Eric Seidel (no email)
Comment 55
2010-04-21 18:34:40 PDT
Committed
r58036
: <
http://trac.webkit.org/changeset/58036
>
Eric Seidel (no email)
Comment 56
2010-04-21 20:10:29 PDT
Created
attachment 54018
[details]
Fix for run-webkit-tests --chromium
Eric Seidel (no email)
Comment 57
2010-04-21 20:17:23 PDT
Committed
r58044
: <
http://trac.webkit.org/changeset/58044
>
Eric Seidel (no email)
Comment 58
2010-04-21 20:40:51 PDT
Created
attachment 54019
[details]
Patch
Eric Seidel (no email)
Comment 59
2010-04-21 21:31:42 PDT
Committed
r58050
: <
http://trac.webkit.org/changeset/58050
>
Eric Seidel (no email)
Comment 60
2010-04-21 22:41:48 PDT
Committed
r58054
: <
http://trac.webkit.org/changeset/58054
>
Eric Seidel (no email)
Comment 61
2010-04-21 22:50:00 PDT
Committed
r58055
: <
http://trac.webkit.org/changeset/58055
>
Eric Seidel (no email)
Comment 62
2010-04-21 23:20:04 PDT
Committed
r58059
: <
http://trac.webkit.org/changeset/58059
>
Eric Seidel (no email)
Comment 63
2010-04-21 23:30:38 PDT
Committed
r58060
: <
http://trac.webkit.org/changeset/58060
>
Eric Seidel (no email)
Comment 64
2010-04-21 23:39:32 PDT
Committed
r58062
: <
http://trac.webkit.org/changeset/58062
>
Eric Seidel (no email)
Comment 65
2010-04-22 03:29:25 PDT
Chromium is reporting random hangs in new-run-webkit-tests --chromium after all these changes. I'm not sure what could be causing them yet. We really need a backtrace from the hang.
Eric Seidel (no email)
Comment 66
2010-04-22 03:32:43 PDT
http://trac.webkit.org/changeset/58059
looks the most suspicious for possibly causing hangs, but again we need a reproduction including backtrace to really know what's getting hung. Some print-based debugging might lead us in teh right direciton. I'm headed to bed soon, but I'm happy to look at this first thing in the morning (well, probably afternoon at this point).
Eric Seidel (no email)
Comment 67
2010-04-22 03:58:23 PDT
(In reply to
comment #66
)
>
http://trac.webkit.org/changeset/58059
looks the most suspicious for possibly > causing hangs, but again we need a reproduction including backtrace to really > know what's getting hung. Some print-based debugging might lead us in teh > right direciton. > > I'm headed to bed soon, but I'm happy to look at this first thing in the > morning (well, probably afternoon at this point).
That change probably made us vulernable to
http://bugs.python.org/issue1236
again. Which I had mistakenly thought was a 2.4 only bug. Perhaps one of the orignial new-run-webkit-test authors can comment on how
http://bugs.python.org/issue1236
manifests? Could it be the hangs Yuzo is seeing? Again, we just need a backtrace and then solving the hang shoudl be simple. :)
Dirk Pranke
Comment 68
2010-04-22 18:53:22 PDT
(In reply to
comment #62
)
> Committed
r58059
: <
http://trac.webkit.org/changeset/58059
>
This change doesn't actually call format_wdiff_output_as_html() ... presumably this breaks wdiff output?
Fumitoshi Ukai
Comment 69
2010-04-22 22:26:27 PDT
when i run test_webkitpy on cygwin, test-webkitpy: INFO Suppressing most webkitpy logging while running unit tests. webkitpy.test.main: INFO Excluding: webkitpy.common.checkout.scm_unittest (use --all to include) ... ====================================================================== ERROR: Validate that it is safe to pass unicode() objects ---------------------------------------------------------------------- Traceback (most recent call last): File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 53, in test_run_command_with_unicode output = executive.run_command(["echo", "-n", unicode_tor]) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/common/system/executive.py", line 211, in run_command cwd=cwd) File "/usr/lib/python2.5/subprocess.py", line 594, in __init__ errread, errwrite) File "/usr/lib/python2.5/subprocess.py", line 1091, in _execute_child raise child_exception TypeError: execv() arg 2 must contain only strings ====================================================================== FAIL: test_basics (webkitpy.layout_tests.run_webkit_tests_unittest.DryrunTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 63, in test_basics 'fast/html'])) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 41, in passing_run res = run_webkit_tests.main(options, args, False) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1529, in main options.configuration = port_obj.default_configuration() File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/port/webkit.py", line 304, in default_configuration assert(configuration == "Debug" or configuration == "Release") AssertionError $ python Python 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.124)] on cygwin should i use more recent version of python, cygwin?
Note
You need to
log in
before you can comment on or make changes to this bug.
Top of Page
Format For Printing
XML
Clone This Bug