37765 – REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø

Eric Seidel (no email)

Reported 2010-04-18 00:10:27 PDT

REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø I fixed the commit-queue ignoring Tor as a reviewer, but now it throws an exception every time it tries to land a patch with his name in it. I created this bug by hand due to bug 37764. :(

Chris Jerdonek

Comment 1 2010-04-18 02:09:47 PDT

I'm not sure if code like this will solve it: http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/webkitpy/style/checker.py?rev=57467#L564 Whatever the answer is, we may want to expose the necessary helper methods in webkitpy/common/ since we should probably be supporting unicode characters across the board.

Eric Seidel (no email)

Comment 2 2010-04-18 02:14:34 PDT

I've now coded up two versions of this patch. Both have failed in different places. :( Unicode = trail of tears. The first version tried to deploy use of the unicode() class more widely. That ran into trouble because we will have to change a bunch of places to be more unicode() aware and correctly encode to utf-8 before calling things like file.write(). The second version tried to remove use of the unicode() class from our codebase and instead just encode Tor's name in utf8 in a str() object. Turns out that fails too, because if you call str("Tor Arne Vestb\xC3\xB8").encode("utf8") python notices that \xC3 is not a valid ascii character and throws an exception. So it seems like python is forcing us towards using unicode(). That's just gonna hurt. A lot.

Eric Seidel (no email)

Comment 3 2010-04-18 02:20:44 PDT

http://farmdev.com/talks/unicode/ is really pushing unicode() Not sure I'm ready for the koolaid.

Eric Seidel (no email)

Comment 4 2010-04-18 02:25:48 PDT

OK. I think I'm maybe convinced. We should take the unicode dive. I will need a few hours and a good stiff drink. Maybe monday.

Chris Jerdonek

Comment 5 2010-04-18 02:29:29 PDT

(In reply to comment #4) > OK. I think I'm maybe convinced. > > We should take the unicode dive. I will need a few hours and a good stiff > drink. Maybe monday. Briefly, what's the pain point?

Eric Seidel (no email)

Comment 6 2010-04-18 02:47:51 PDT

(In reply to comment #5) > > We should take the unicode dive. I will need a few hours and a good stiff > > drink. Maybe monday. > > Briefly, what's the pain point? What do you mean?

Chris Jerdonek

Comment 7 2010-04-18 02:55:08 PDT

(In reply to comment #6) > > Briefly, what's the pain point? > > What do you mean? I was just asking if you could explain briefly why it will hurt so much to start using it more widely (as you say below). And will that be true even if we use it just in the places we really need it? > So it seems like python is forcing us towards using unicode(). That's just > gonna hurt. A lot.

Eric Seidel (no email)

Comment 8 2010-04-18 03:00:23 PDT

Every place that we log, call print, or call write we will need to encode. Many of our unit tests will need unicode versions added (eventually). Every time we open a file or read off the network (when we're not using BeautifulSoup) we will need to document what encoding we expect. Totally doable. Just will be a large change. Looking at it now.

Eric Seidel (no email)

Comment 9 2010-04-18 04:14:28 PDT

Created attachment 53625 [details] Patch

Eric Seidel (no email)

Comment 10 2010-04-18 04:15:14 PDT

Comment on attachment 53625 [details] Patch I'm sure this change will break things. We'll fix them when we find where it broke stuff.

WebKit Review Bot

Comment 11 2010-04-18 04:20:48 PDT

Attachment 53625 [details] did not pass style-queue: Failed to run "WebKitTools/Scripts/check-webkit-style" exit_code: 1 WebKitTools/Scripts/webkitpy/common/net/buildbot.py:413: whitespace before ':' [pep8/E203] [5] WebKitTools/Scripts/webkitpy/common/system/autoinstall.py:127: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/steps/postdiff.py:47: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:265: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:407: at least two spaces before inline comment [pep8/E261] [5] WebKitTools/Scripts/webkitpy/tool/commands/upload.py:427: at least two spaces before inline comment [pep8/E261] [5] Total errors found: 6 in 24 files If any of these errors are false positives, please file a bug against check-webkit-style.

Eric Seidel (no email)

Comment 12 2010-04-18 04:38:02 PDT

Comment on attachment 53625 [details] Patch Nope, not quite there. Will post a better patch in teh morning.

Eric Seidel (no email)

Comment 13 2010-04-18 04:59:06 PDT

Created attachment 53626 [details] Patch

Eric Seidel (no email)

Comment 14 2010-04-18 05:00:10 PDT

I think this one finally works. I ended up having to rip out a bunch of StringIO plumbing. Basically every time we use StringIO, we have to be sure to encode the unicode string to a utf-8 byte array so that StringIO can vend back byte arrays via its read() method.

Chris Jerdonek

Comment 15 2010-04-18 10:41:18 PDT

Thanks for taking this on and looking into this, Eric. (Good slide show in comment 3, btw.) A couple random comments before a couple comments on the patch: (1) As the slides suggest, we may want to consider decoding using "utf-8-sig" as a general practice instead of "utf-8": "To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig").... Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file." (from http://docs.python.org/library/codecs.html#encodings-and-unicode ) (2) I came across this which might be useful somewhere (intelligent encoding auto-detection): http://chardet.feedparser.org/

Adam Barth

Comment 16 2010-04-18 11:23:36 PDT

We should be careful about BOM proliferation. It might confuse tools that don't understand it.

Chris Jerdonek

Comment 17 2010-04-18 11:31:19 PDT

(In reply to comment #16) > We should be careful about BOM proliferation. It might confuse tools that > don't understand it. Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which would remove the BOM.

Chris Jerdonek

Comment 18 2010-04-18 11:36:29 PDT

+++ b/WebKitTools/ChangeLog @@ -1,3 +1,93 @@ + We do not have to use u"" instead of "" because u"a" == "a" as expected + in Python. Python will generate a warning to the console in cases where + a unicode() == str() operation cannot be performed. Can you clarify what you are trying to say here? It sounds like you might be saying more than you mean to say -- e.g. that the caller never needs to use unicode string literals, which is obviously not right. + All places which use StringIO need to be sure to pass StringIO a + pre-encoded byte-array (str object) instead of unicode so that + clients which read from the StringIO don't have encoding exceptions. It seems like we shouldn't need to use StringIO as much as we do (except perhaps in unittest code to make certain tests easier). It looks like you're removing a lot of the places it's being used which seems good. @@ -117,7 +118,9 @@ class ChangeLog(object): def latest_entry(self): # ChangeLog files are always UTF-8, we read them in as such to support Reviewers with unicode in their names. - changelog_file = codecs.open(self.path, "r", "utf-8") + # We don't use codecs.open here to make the api for parse_latest_entry_from_file clearer. + # If we did, then it would be unclear as to whos reponsibility decoding of the file should be. + changelog_file = open(self.path, "r") try: return self.parse_latest_entry_from_file(changelog_file) I understand what you're saying, but maybe the conclusion that should be drawn is that we shouldn't have parse_latest_entry_from_file() as part of our API? In other words, the caller should be responsible. That would also be more consistent with the rule of thumb to decode as early as possible. +++ b/WebKitTools/Scripts/webkitpy/common/system/deprecated_logging.py @@ -45,9 +45,10 @@ class tee: def __init__(self, *files): self.files = files - def write(self, string): + # Callers should pass an already encoded string for writing. + def write(self, bytes): for file in self.files: - file.write(string) + file.write(bytes) Doesn't this also go against "unicode everywhere/encode late"? Maybe this would be a good spot to do type-checking between unicode/str for backwards compatibility. +++ b/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/metered_stream.py @@ -56,6 +56,8 @@ class MeteredStream: self._stream = stream self._last_update = "" + # FIXME: Does this take a string (unicode) or an array of bytes (str)? + # If it takes a string, it needs to call txt.encode("utf-8") def write(self, txt): """Write text directly to the stream, overwriting and resetting the meter.""" I'm a bit unclear on the extent to which, for example, we'll need to be passing unicode strings even for every log message we create. You've probably thought about this more than I have. Out of curiosity, I started looking to see how Python's logging package handles this. I stopped when it started to look like the logging package has no preference between str and unicode. I think it might only be the particular logging handlers that care, which is what the caller has control over (when they configure logging).

Eric Seidel (no email)

Comment 19 2010-04-18 15:36:29 PDT

(In reply to comment #18) Good questions! > Can you clarify what you are trying to say here? It sounds like you might be > saying more than you mean to say -- e.g. that the caller never needs to use > unicode string literals, which is obviously not right. I was attempting to excuse why I didn't add u"" to every string literal in the project. :) If you forget the u"" and it actually contains non-ascii code points, python will yell at you. > It seems like we shouldn't need to use StringIO as much as we do (except > perhaps in unittest code to make certain tests easier). It looks like you're > removing a lot of the places it's being used which seems good. StringIO is just confusingly named. It deals with a stream of bytes. aka, a "str" object in Python 2.x. In Python 3.0 "str" is actually a unicode() object, and there is an explicit bytes() type. StringIO probably takes a bytes() object in 3.0. :) There are few places we actually need StringIO, since we tend to follow the model of just reading everything into memory. The files we deal with tend to be small, and the machines these scripts run on rather beefy. > I understand what you're saying, but maybe the conclusion that should be drawn > is that we shouldn't have parse_latest_entry_from_file() as part of our API? > In other words, the caller should be responsible. That would also be more > consistent with the rule of thumb to decode as early as possible. Any API which takes a file-like object is rather confusing. Because the read() method is expected to return a byte stream. However you can use codecs.open() to create a file-like object which will return unicode() strings instead of str() byte streams from the various methods. > Doesn't this also go against "unicode everywhere/encode late"? Maybe this > would be a good spot to do type-checking between unicode/str for backwards > compatibility. Nope. .write() is expected to deal with byte streams. Think of file.write() > I'm a bit unclear on the extent to which, for example, we'll need to be passing > unicode strings even for every log message we create. You've probably thought > about this more than I have. Well, depends on the logging infrastructure. As far as I can tell Python 2.5 "print" seems to be smart enough to handle unicode() objects. file.write() not so much (unless you open the file with codecs.open and declare an encoding) So depends on what sort of logging we're doing. If we're logging to the console, it seems python is smart enough to handle unicode for us. If we're logging to a file we may have to be careful to call .encode("utf-8") to create the appropriate byte stream first. > Out of curiosity, I started looking to see how Python's logging package handles > this. I stopped when it started to look like the logging package has no > preference between str and unicode. I think it might only be the particular > logging handlers that care, which is what the caller has control over (when > they configure logging). There is not difference between "str" and "unicode" as far as most of python is concerned. I wouldn't expect the logging module to care, except if it's writing to files or to the console (however I suspect that when writing to the console the unicode encoding is abstracted at a lower layer).

Eric Seidel (no email)

Comment 20 2010-04-18 15:38:01 PDT

(In reply to comment #17) > (In reply to comment #16) > > We should be careful about BOM proliferation. It might confuse tools that > > don't understand it. > > Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which > would remove the BOM. I don't think we need to use utf-8-sig anywhere. Our tools don't (and shouldn't) write out BOMs and we know the encodings of all our files. It's possible that editing ChangeLog files in MSVC would spit out a BOM in the file. If that's the case, we might have to use "utf-8-sig" when reading in ChangeLog files to be safe.

Tony Chang

Comment 21 2010-04-18 17:48:25 PDT

Comment on attachment 53626 [details] Patch In response to comment #2, I think it would work if you use decode instead of encode (the string you passed in is already utf8 encoded, you want to convert it to a unicode object, right?). FWIW, I would probably have tried to keep everything as a utf8 encoded str, but using unicode everywhere seems fine too (you get some better type checking this way). In either case, thanks for taking this on. Some small style nits below. > diff --git a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > index e93896f..862dc48 100644 > --- a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > +++ b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py > @@ -102,12 +102,13 @@ class ChangeLog(object): > date_line_regexp = re.compile(ChangeLogEntry.date_line_regexp) > entry_lines = [] > # The first line should be a date line. > - first_line = changelog_file.readline() > + first_line = unicode(changelog_file.readline(), "utf-8") Nit: I find using .decode('utf-8') easier to read than calling unicode(), but either is fine with me. There's another instance of this in the same file. > diff --git a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > index bb22d14..2dd4d20 100644 > --- a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > +++ b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py > def _parse_attachment_element(self, element, bug_id): > + Nit: Did you mean to add this blank line? > diff --git a/WebKitTools/Scripts/webkitpy/common/system/executive.py b/WebKitTools/Scripts/webkitpy/common/system/executive.py > index b6126e4..cbe0e1f 100644 > --- a/WebKitTools/Scripts/webkitpy/common/system/executive.py > +++ b/WebKitTools/Scripts/webkitpy/common/system/executive.py > @@ -145,7 +157,7 @@ class Executive(object): > # os.kill isn't available on Windows. However, when I tried it > # using Cygwin, it worked fine. We should investigate whether > # we need this platform specific code here. > - subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)), > + subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)), Nit: This can probably stay as str(). It's a bit weird to have the other strings be str and the pid be unicode.

Chris Jerdonek

Comment 22 2010-04-18 19:59:38 PDT

(In reply to comment #19) Thanks for your responses, Eric. > > I understand what you're saying, but maybe the conclusion that should be drawn > > is that we shouldn't have parse_latest_entry_from_file() as part of our API? > > In other words, the caller should be responsible. That would also be more > > consistent with the rule of thumb to decode as early as possible. > > Any API which takes a file-like object is rather confusing. Because the read() > method is expected to return a byte stream. However you can use codecs.open() > to create a file-like object which will return unicode() strings instead of > str() byte streams from the various methods. I guess what I was getting at here is that -- following the "decode early" mantra -- we might want to have parse_latest_entry_from_file() accept a file-like object from codecs.open() rather than a UTF-8 encoded stream. That way parse_latest_entry_from_file() doesn't have to be encoding aware and will see only unicode strings. I only bring this up because this issue will probably come up in more spots as we review more of our code. Any time we define a function that takes a file-like object representing text, we'll need to decide if that function should take a UTF-8 encoded stream, say, or an encoding-aware stream from codecs. (The codecs module blurs the recommendations in the slide show somewhat because you can "decode early" while still seeming to handle a stream of bytes.) Incidentally, the Python documentation notes that in Python 3, "there is no longer any need for using the encoding-aware streams in the codecs module." (from http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit )

Eric Seidel (no email)

Comment 23 2010-04-18 22:20:31 PDT

(In reply to comment #21) > (From update of attachment 53626 [details]) > In response to comment #2, I think it would work if you use decode instead of > encode (the string you passed in is already utf8 encoded, you want to convert > it to a unicode object, right?). My example was slightly confusing, and I was myself slightly confused. Now that I see "str" as an array of bytes and not actually a string type the world makes a lot more sense. I was attempting to unconditionally encode any "string like object" to a utf-8 bytestream before passing off to the Popen call. This meant that if I was passing around utf8-encoded str objects, and tried to call "encode" on them, it would scream at me. > FWIW, I would probably have tried to keep everything as a utf8 encoded str, but > using unicode everywhere seems fine too (you get some better type checking this > way). In either case, thanks for taking this on. Yes, it's true, that could work. However there are already many APIs which feed us unicode() objects and so we'd have to be careful to be sure to convert those back to utf-8 byte stream (str) objects before passing them around or comparisons would fail. I decided to go with the unicode() solution, mostly because it seemed the most consistent in my head. Turns out it's also the recommended way to prepare for Python 3.0 (which we'll eventually get to in a year or more). :) > > - first_line = changelog_file.readline() > > + first_line = unicode(changelog_file.readline(), "utf-8") > > Nit: I find using .decode('utf-8') easier to read than calling unicode(), but > either is fine with me. There's another instance of this in the same file. Yes. The only difference is handling of None. However I agree in this case decode is definitely nicer. > > def _parse_attachment_element(self, element, bug_id): > > + > > Nit: Did you mean to add this blank line? Nope. Will fix. > > - subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)), > > + subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)), > > Nit: This can probably stay as str(). It's a bit weird to have the other > strings be str and the pid be unicode. It certainly could. I was trying to move away from ever using the str() constructor. str() is basically always bad news, even when we use it to mean "make this number a string".

Eric Seidel (no email)

Comment 24 2010-04-18 22:23:37 PDT

(In reply to comment #22) > I guess what I was getting at here is that -- following the "decode early" > mantra -- we might want to have parse_latest_entry_from_file() accept a > file-like object from codecs.open() rather than a UTF-8 encoded stream. That > way parse_latest_entry_from_file() doesn't have to be encoding aware and will > see only unicode strings. Hmm... Maybe. Seems a bit magical. We'd have to assert that .read() returned isinstance(unicode) values or something to be sure. You're right that it would make the code read more like Python 3.0 would. :) > I only bring this up because this issue will probably come up in more spots as > we review more of our code. Any time we define a function that takes a > file-like object representing text, we'll need to decide if that function > should take a UTF-8 encoded stream, say, or an encoding-aware stream from > codecs. Yes. And your point is valid. Not sure how to make that work with StringIO for unittesting though. If I remember correctly StringIO(unicode(...)) will throw exceptions when read() is called. > (The codecs module blurs the recommendations in the slide show somewhat because > you can "decode early" while still seeming to handle a stream of bytes.) > > Incidentally, the Python documentation notes that in Python 3, "there is no > longer any need for using the encoding-aware streams in the codecs module." > > (from > http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit > ) Thank you. :) Very helpful comments as always.

Chris Jerdonek

Comment 25 2010-04-19 13:52:05 PDT

(In reply to comment #24) > Yes. And your point is valid. Not sure how to make that work with StringIO > for unittesting though. If I remember correctly StringIO(unicode(...)) will > throw exceptions when read() is called. I haven't tried this out myself to check, but the Python documentation seems to say this is okay as long as one doesn't mix the two: http://docs.python.org/library/stringio.html#StringIO.StringIO

Eric Seidel (no email)

Comment 26 2010-04-19 15:02:16 PDT

You're right, StringIO(unicode()) seems to work fine: >>> import StringIO >>> tor = u"Vestb\u00f8" >>> print tor Vestbø >>> tor_io =StringIO.StringIO(tor) >>> read_tor =tor_io.read() >>> print read_tor Vestbø >>> print read_tor.__class__ <type 'unicode'>

Eric Seidel (no email)

Comment 27 2010-04-19 18:33:55 PDT

I can haz r+ or r- plz?

Adam Barth

Comment 28 2010-04-20 11:58:36 PDT

Looking.

Adam Barth

Comment 29 2010-04-20 12:08:57 PDT

Comment on attachment 53626 [details] Patch Oh man. This looks like a good approach. Now we get to see how good our test coverage is. Btw, we should write up a doc about reviewing/writing webkitpy code that explains these kinds of constraints we have going forward.

Chris Jerdonek

Comment 30 2010-04-20 12:19:30 PDT

(In reply to comment #29) > Btw, we should write up a doc about reviewing/writing webkitpy code that > explains these kinds of constraints we have going forward. I'm not sure what kind of doc you're envisioning, but I've started to put general information about WebKit Python and webkitpy here: http://trac.webkit.org/wiki/PythonGuidelines (linked under the "Documentation" section on the main wiki page)

Tor Arne Vestbø

Comment 31 2010-04-20 12:25:56 PDT

Much love for looking into this Eric!

Adam Barth

Comment 32 2010-04-20 12:31:06 PDT

That wiki page looks great. Would you be willing to add some info about our Unicode strategy and good and bad patterns?

Chris Jerdonek

Comment 33 2010-04-20 12:44:29 PDT

(In reply to comment #32) > That wiki page looks great. Would you be willing to add some info about our > Unicode strategy and good and bad patterns? Sure, I could do that. For now I stubbed out a section with a link to this comment thread. But don't let that stop anyone else from adding to that section if I don't get to it right away. :)

Eric Seidel (no email)

Comment 34 2010-04-20 12:55:00 PDT

Committed r57907: <http://trac.webkit.org/changeset/57907>

WebKit Review Bot

Comment 35 2010-04-20 13:54:49 PDT

http://trac.webkit.org/changeset/57907 might have broken Windows Debug (Tests) The following changes are on the blame list: http://trac.webkit.org/changeset/57904 http://trac.webkit.org/changeset/57905 http://trac.webkit.org/changeset/57906 http://trac.webkit.org/changeset/57907 http://trac.webkit.org/changeset/57908

Eric Seidel (no email)

Comment 36 2010-04-20 14:30:40 PDT

Reverted r57907 for reason: Appears to have broken MacEWS and possibly webkit-patch upload for svn users. Needs further investigation. Committed r57920: <http://trac.webkit.org/changeset/57920>

Eric Seidel (no email)

Comment 37 2010-04-20 14:31:52 PDT

The chromium bots also saw a failure in new-run-webkit-tests: http://build.chromium.org/buildbot/waterfall.fyi/builders/Webkit%20Linux%20(webkit.org)/builds/23994/steps/webkit_tests/logs/stdio

Dirk Pranke

Comment 38 2010-04-20 14:50:39 PDT

So, to summarize the thinking in this patch: 1) assume all text files are encoded on disk as UTF-8 2) read files using codecs.open(..., "r", encoding="utf-8") 3) write files using codecs.open(..., "w", enocding="utf-8"), write(s.encode("utf-8")) if s is a string that is not known to be encoded already Correct? It would be good if we provided wrappers for (2) and (3) that handled this automatically and centralized the open and close logic in one place. Perhaps the read_file() and write_file() utility routines in dryrun.py that are supposed to be split off into their own file would be a good start on standardizing this.

Eric Seidel (no email)

Comment 39 2010-04-20 14:55:36 PDT

(In reply to comment #38) > So, to summarize the thinking in this patch: > > 1) assume all text files are encoded on disk as UTF-8 More precisely: be explicit about how things are encoded, whatever that encoding may be. > 2) read files using codecs.open(..., "r", encoding="utf-8") Possibly. This patch uses a combination. codecs.open makes file object return unicode() strings instead of str byte arrays from read() and friends. We use that in some palces. cjerdonek was suggesting we standardize, but I didn't do a good job of that in this patch. > 3) write files using codecs.open(..., "w", enocding="utf-8"), > write(s.encode("utf-8")) > if s is a string that is not known to be encoded already One or the other, not both. if you open with codecs.open it will do the magic for you. if you open with open() you have to encode to a byte array (str) before calling write. > It would be good if we provided wrappers for (2) and (3) that handled this > automatically and centralized the open and close logic in one place. > > Perhaps the read_file() and write_file() utility routines in dryrun.py that are > supposed to be split off into their own file would be a good start on > standardizing this. Possibly. I think Chris is probably right. We should probably standarize on using codecs.open everywhere and then be able to assume that read/write handle unicode (like they do in Python 3.0).

Dirk Pranke

Comment 40 2010-04-20 15:10:13 PDT

(In reply to comment #39) > (In reply to comment #38) > > So, to summarize the thinking in this patch: > > > > 1) assume all text files are encoded on disk as UTF-8 > > More precisely: be explicit about how things are encoded, whatever that > encoding may be. > I was (perhaps badly) attempting to be stronger than that. What I meant was, previously we had probably assumed that all of the files are ASCII. We don't want to assume that, but since we don't want to try and do encoding detection, we should probably assume something, and that should probably be UTF-8. > > 2) read files using codecs.open(..., "r", encoding="utf-8") > > Possibly. This patch uses a combination. codecs.open makes file object return > unicode() strings instead of str byte arrays from read() and friends. We use > that in some palces. cjerdonek was suggesting we standardize, but I didn't do > a good job of that in this patch. I agree with Chris we should standardize, although it doesn't have to happen in this patch. We should file a separate bug to clean up the rest of the open() calls. > > > 3) write files using codecs.open(..., "w", enocding="utf-8"), > > write(s.encode("utf-8")) > > if s is a string that is not known to be encoded already > > One or the other, not both. if you open with codecs.open it will do the magic > for you. if you open with open() you have to encode to a byte array (str) > before calling write. Okay. I wasn't clear about that.

Eric Seidel (no email)

Comment 41 2010-04-20 15:14:52 PDT

(In reply to comment #40) > (In reply to comment #39) > > (In reply to comment #38) > > > So, to summarize the thinking in this patch: > > > > > > 1) assume all text files are encoded on disk as UTF-8 > > > > More precisely: be explicit about how things are encoded, whatever that > > encoding may be. > > > > I was (perhaps badly) attempting to be stronger than that. What I meant was, > previously we had probably assumed that all of the files are ASCII. We don't > want to assume that, but since we don't want to try and do encoding detection, > we should probably assume something, and that should probably be UTF-8. Except they weren't ascii. :) The exception Chromium hit when this was landed was due to the diff-processing code trying to write out a unicode string. Previously we were blissfully unaware that diff/wdiff/prettypatch were returning us utf-8 encoded data and we were passing it around as str()s. After this patch, we convert data returned from commands back into unicode() strings. Making some things easier, but requiring us to be explicit about converting our unicode strings back to byte arrays before writing them out to files (or using codecs.open to make our write() calls transparently convert).

WebKit Review Bot

Comment 42 2010-04-20 15:17:03 PDT

http://trac.webkit.org/changeset/57920 might have broken Leopard Intel Release (Tests) The following changes are on the blame list: http://trac.webkit.org/changeset/57920 http://trac.webkit.org/changeset/57921

Dirk Pranke

Comment 43 2010-04-20 15:21:36 PDT

(In reply to comment #41) > (In reply to comment #40) > > (In reply to comment #39) > > > (In reply to comment #38) > > > > So, to summarize the thinking in this patch: > > > > > > > > 1) assume all text files are encoded on disk as UTF-8 > > > > > > More precisely: be explicit about how things are encoded, whatever that > > > encoding may be. > > > > > > > I was (perhaps badly) attempting to be stronger than that. What I meant was, > > previously we had probably assumed that all of the files are ASCII. We don't > > want to assume that, but since we don't want to try and do encoding detection, > > we should probably assume something, and that should probably be UTF-8. > > Except they weren't ascii. :) Agreed, obviously. I wasn't trying to suggest there weren't bugs we were runing into.

Eric Seidel (no email)

Comment 44 2010-04-21 03:33:19 PDT

Created attachment 53937 [details] Fixed to be way more awesome

Eric Seidel (no email)

Comment 45 2010-04-21 03:35:57 PDT

I've updated the patch to deploy "codecs.open" everywhere, as per Chris's suggestion. While I was looking at every open() call, I converted a bunch of them to use "with" to avoid file descriptor leaks. The patch is huge now. However, it does fix the 3 regressions seen when landing the last patch: 1. mac-ews was failing due to having changed how --status-host was passed, our global argument parsing is a hack, at least I've documented that now. 2. webkit-patch upload was broken for svn users due to the svn-create-patch call not passing its arguments as an array. 3. new-run-webkit-tests was broken for chromium due to diff creation logic not specifying an encoding.

Adam Barth

Comment 46 2010-04-21 13:46:06 PDT

Comment on attachment 53937 [details] Fixed to be way more awesome This change is too large to convince myself that it's correct, but the approach seems good and we want to do it all at once. One thought: how much of this can we wrap in a filesystem abstraction? I discussed that idea with Dirk on another bug. I don't think we need to do that in this patch, but it's something to keep in mind.

Eric Seidel (no email)

Comment 47 2010-04-21 13:50:05 PDT

The file system abstraction is really codecs.open() I learned at the end of this change that I could have even deployed codecs.open wider. We should *never* be using open() directly. codecs.open takes an encoding parameter (just like Python 3.x open() does). encoding=None results in open()-like, return-a-str-object behavior. So in a followup patch we should remove all uses of raw open(). We should also avoid ever closing a file manually. Doing so will fail in teh case of an exception, which is why we use with blocks. :)

Dirk Pranke

Comment 48 2010-04-21 14:06:25 PDT

(In reply to comment #47) > The file system abstraction is really codecs.open() I learned at the end of > this change that I could have even deployed codecs.open wider. > > We should *never* be using open() directly. codecs.open takes an encoding > parameter (just like Python 3.x open() does). encoding=None results in > open()-like, return-a-str-object behavior. > > So in a followup patch we should remove all uses of raw open(). > > We should also avoid ever closing a file manually. Doing so will fail in teh > case of an exception, which is why we use with blocks. :) Note that with blocks only do try/finally; you should still be aware of needing to do try/catch if you don't want the exceptions to propagate.

Chris Jerdonek

Comment 49 2010-04-21 14:18:05 PDT

(In reply to comment #47) > We should *never* be using open() directly. codecs.open takes an encoding > parameter (just like Python 3.x open() does). encoding=None results in > open()-like, return-a-str-object behavior. I just wanted to bring attention to a couple other uses of the codecs module that I noticed in check-webkit-style -- in case they might be useful elsewhere in webkitpy. Mainly this code-- # Change stderr to write with replacement characters so we don't die # if we try to print something containing non-ASCII characters. stderr = codecs.StreamReaderWriter(sys.stderr, codecs.getreader('utf8'), codecs.getwriter('utf8'), 'replace') # Setting an "encoding" attribute on the stream is necessary to # prevent the logging module from raising an error. See # the checker.configure_logging() function for more information. stderr.encoding = "UTF-8" # FIXME: Change webkitpy.style so that we do not need to overwrite # the global sys.stderr. This involves updating the code to # accept a stream parameter where necessary, and not calling # sys.stderr explicitly anywhere. sys.stderr = stderr (This is from http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/check-webkit-style?rev=57467#L61 ) IIRC, this codecs.StreamReaderWriter code was present in its basic form from the beginning. I'm not sure when we might want to use this pattern over codecs.open(), but perhaps it might be useful elsewhere. Note also the use of the errors='replace' parameter, which codecs.open() also accepts. The "errors" parameter in these methods controls how bad bytes should be handled when encountered while reading from or writing to a stream. There may be cases where we want to be more lenient than the default value of errors='strict' and continue processing a file if possible. It doesn't look like the errors parameter is being used in any of the codecs.open() calls in this patch (which could very well be exactly what we want).

Eric Seidel (no email)

Comment 50 2010-04-21 15:04:36 PDT

Committed r58014: <http://trac.webkit.org/changeset/58014>

Eric Seidel (no email)

Comment 51 2010-04-21 15:47:27 PDT

Created attachment 53998 [details] Fix Chromium Canaries

Eric Seidel (no email)

Comment 52 2010-04-21 15:52:39 PDT

Committed r58016: <http://trac.webkit.org/changeset/58016>

Eric Seidel (no email)

Comment 53 2010-04-21 16:31:39 PDT

Committed r58023: <http://trac.webkit.org/changeset/58023>

Eric Seidel (no email)

Comment 54 2010-04-21 18:23:39 PDT

Was rolled out a second time.

Eric Seidel (no email)

Comment 55 2010-04-21 18:34:40 PDT

Committed r58036: <http://trac.webkit.org/changeset/58036>

Eric Seidel (no email)

Comment 56 2010-04-21 20:10:29 PDT

Created attachment 54018 [details] Fix for run-webkit-tests --chromium

Eric Seidel (no email)

Comment 57 2010-04-21 20:17:23 PDT

Committed r58044: <http://trac.webkit.org/changeset/58044>

Eric Seidel (no email)

Comment 58 2010-04-21 20:40:51 PDT

Created attachment 54019 [details] Patch

Eric Seidel (no email)

Comment 59 2010-04-21 21:31:42 PDT

Committed r58050: <http://trac.webkit.org/changeset/58050>

Eric Seidel (no email)

Comment 60 2010-04-21 22:41:48 PDT

Committed r58054: <http://trac.webkit.org/changeset/58054>

Eric Seidel (no email)

Comment 61 2010-04-21 22:50:00 PDT

Committed r58055: <http://trac.webkit.org/changeset/58055>

Eric Seidel (no email)

Comment 62 2010-04-21 23:20:04 PDT

Committed r58059: <http://trac.webkit.org/changeset/58059>

Eric Seidel (no email)

Comment 63 2010-04-21 23:30:38 PDT

Committed r58060: <http://trac.webkit.org/changeset/58060>

Eric Seidel (no email)

Comment 64 2010-04-21 23:39:32 PDT

Committed r58062: <http://trac.webkit.org/changeset/58062>

Eric Seidel (no email)

Comment 65 2010-04-22 03:29:25 PDT

Chromium is reporting random hangs in new-run-webkit-tests --chromium after all these changes. I'm not sure what could be causing them yet. We really need a backtrace from the hang.

Eric Seidel (no email)

Comment 66 2010-04-22 03:32:43 PDT

http://trac.webkit.org/changeset/58059 looks the most suspicious for possibly causing hangs, but again we need a reproduction including backtrace to really know what's getting hung. Some print-based debugging might lead us in teh right direciton. I'm headed to bed soon, but I'm happy to look at this first thing in the morning (well, probably afternoon at this point).

Eric Seidel (no email)

Comment 67 2010-04-22 03:58:23 PDT

(In reply to comment #66) > http://trac.webkit.org/changeset/58059 looks the most suspicious for possibly > causing hangs, but again we need a reproduction including backtrace to really > know what's getting hung. Some print-based debugging might lead us in teh > right direciton. > > I'm headed to bed soon, but I'm happy to look at this first thing in the > morning (well, probably afternoon at this point). That change probably made us vulernable to http://bugs.python.org/issue1236 again. Which I had mistakenly thought was a 2.4 only bug. Perhaps one of the orignial new-run-webkit-test authors can comment on how http://bugs.python.org/issue1236 manifests? Could it be the hangs Yuzo is seeing? Again, we just need a backtrace and then solving the hang shoudl be simple. :)

Dirk Pranke

Comment 68 2010-04-22 18:53:22 PDT

(In reply to comment #62) > Committed r58059: <http://trac.webkit.org/changeset/58059> This change doesn't actually call format_wdiff_output_as_html() ... presumably this breaks wdiff output?

Fumitoshi Ukai

Comment 69 2010-04-22 22:26:27 PDT

when i run test_webkitpy on cygwin, test-webkitpy: INFO Suppressing most webkitpy logging while running unit tests. webkitpy.test.main: INFO Excluding: webkitpy.common.checkout.scm_unittest (use --all to include) ... ====================================================================== ERROR: Validate that it is safe to pass unicode() objects ---------------------------------------------------------------------- Traceback (most recent call last): File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 53, in test_run_command_with_unicode output = executive.run_command(["echo", "-n", unicode_tor]) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/common/system/executive.py", line 211, in run_command cwd=cwd) File "/usr/lib/python2.5/subprocess.py", line 594, in __init__ errread, errwrite) File "/usr/lib/python2.5/subprocess.py", line 1091, in _execute_child raise child_exception TypeError: execv() arg 2 must contain only strings ====================================================================== FAIL: test_basics (webkitpy.layout_tests.run_webkit_tests_unittest.DryrunTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 63, in test_basics 'fast/html'])) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 41, in passing_run res = run_webkit_tests.main(options, args, False) File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1529, in main options.configuration = port_obj.default_configuration() File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit Tools/Scripts/webkitpy/layout_tests/port/webkit.py", line 304, in default_configuration assert(configuration == "Debug" or configuration == "Release") AssertionError $ python Python 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.124)] on cygwin should i use more recent version of python, cygwin?

Attachments
Patch (39.54 KB, patch) 2010-04-18 04:14 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (53.77 KB, patch) 2010-04-18 04:59 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Fixed to be way more awesome (88.59 KB, patch) 2010-04-21 03:33 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Fix Chromium Canaries (3.67 KB, patch) 2010-04-21 15:47 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Fix for run-webkit-tests --chromium (2.00 KB, patch) 2010-04-21 20:10 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Patch (1.45 KB, patch) 2010-04-21 20:40 PDT, Eric Seidel (no email)	no flags	Details Formatted Diff Diff
Show Obsolete (5) View All Add attachment proposed patch, testcase, etc.