Bug 37765 - REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø
Summary: REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Nobody
URL:
Keywords:
Depends on: 37965
Blocks: 35288 37044 37587 37588 37763
  Show dependency treegraph
 
Reported: 2010-04-18 00:10 PDT by Eric Seidel (no email)
Modified: 2010-04-22 22:26 PDT (History)
11 users (show)

See Also:


Attachments
Patch (39.54 KB, patch)
2010-04-18 04:14 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (53.77 KB, patch)
2010-04-18 04:59 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Fixed to be way more awesome (88.59 KB, patch)
2010-04-21 03:33 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Fix Chromium Canaries (3.67 KB, patch)
2010-04-21 15:47 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Fix for run-webkit-tests --chromium (2.00 KB, patch)
2010-04-21 20:10 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff
Patch (1.45 KB, patch)
2010-04-21 20:40 PDT, Eric Seidel (no email)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Seidel (no email) 2010-04-18 00:10:27 PDT
REGRESSION(57531): the commit-queue still hates Tor Arne Vestbø

I fixed the commit-queue ignoring Tor as a reviewer, but now it throws an exception every time it tries to land a patch with his name in it.

I created this bug by hand due to bug 37764. :(
Comment 1 Chris Jerdonek 2010-04-18 02:09:47 PDT
I'm not sure if code like this will solve it:

http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/webkitpy/style/checker.py?rev=57467#L564

Whatever the answer is, we may want to expose the necessary helper methods in webkitpy/common/ since we should probably be supporting unicode characters across the board.
Comment 2 Eric Seidel (no email) 2010-04-18 02:14:34 PDT
I've now coded up two versions of this patch.

Both have failed in different places. :(

Unicode = trail of tears.

The first version tried to deploy use of the unicode() class more widely.  That ran into trouble because we will have to change a bunch of places to be more unicode() aware and correctly encode to utf-8 before calling things like file.write().

The second version tried to remove use of the unicode() class from our codebase and instead just encode Tor's name in utf8 in a str() object.  Turns out that fails too, because if you call str("Tor Arne Vestb\xC3\xB8").encode("utf8") python notices that \xC3 is not a valid ascii character and throws an exception.

So it seems like python is forcing us towards using unicode().  That's just gonna hurt.  A lot.
Comment 3 Eric Seidel (no email) 2010-04-18 02:20:44 PDT
http://farmdev.com/talks/unicode/
is really pushing unicode()

Not sure I'm ready for the koolaid.
Comment 4 Eric Seidel (no email) 2010-04-18 02:25:48 PDT
OK.  I think I'm maybe convinced.

We should take the unicode dive.  I will need a few hours and a good stiff drink.  Maybe monday.
Comment 5 Chris Jerdonek 2010-04-18 02:29:29 PDT
(In reply to comment #4)
> OK.  I think I'm maybe convinced.
> 
> We should take the unicode dive.  I will need a few hours and a good stiff
> drink.  Maybe monday.

Briefly, what's the pain point?
Comment 6 Eric Seidel (no email) 2010-04-18 02:47:51 PDT
(In reply to comment #5)
> > We should take the unicode dive.  I will need a few hours and a good stiff
> > drink.  Maybe monday.
> 
> Briefly, what's the pain point?

What do you mean?
Comment 7 Chris Jerdonek 2010-04-18 02:55:08 PDT
(In reply to comment #6)
> > Briefly, what's the pain point?
> 
> What do you mean?

I was just asking if you could explain briefly why it will hurt so much to start using it more widely (as you say below).  And will that be true even if we use it just in the places we really need it?

> So it seems like python is forcing us towards using unicode().  That's just
> gonna hurt.  A lot.
Comment 8 Eric Seidel (no email) 2010-04-18 03:00:23 PDT
Every place that we log, call print, or call write we will need to encode.  Many of our unit tests will need unicode versions added (eventually).  Every time we open a file or read off the network (when we're not using BeautifulSoup) we will need to document what encoding we expect.

Totally doable.  Just will be a large change.  Looking at it now.
Comment 9 Eric Seidel (no email) 2010-04-18 04:14:28 PDT
Created attachment 53625 [details]
Patch
Comment 10 Eric Seidel (no email) 2010-04-18 04:15:14 PDT
Comment on attachment 53625 [details]
Patch

I'm sure this change will break things.  We'll fix them when we find where it broke stuff.
Comment 11 WebKit Review Bot 2010-04-18 04:20:48 PDT
Attachment 53625 [details] did not pass style-queue:

Failed to run "WebKitTools/Scripts/check-webkit-style" exit_code: 1
WebKitTools/Scripts/webkitpy/common/net/buildbot.py:413:  whitespace before ':'  [pep8/E203] [5]
WebKitTools/Scripts/webkitpy/common/system/autoinstall.py:127:  at least two spaces before inline comment  [pep8/E261] [5]
WebKitTools/Scripts/webkitpy/tool/steps/postdiff.py:47:  at least two spaces before inline comment  [pep8/E261] [5]
WebKitTools/Scripts/webkitpy/tool/commands/upload.py:265:  at least two spaces before inline comment  [pep8/E261] [5]
WebKitTools/Scripts/webkitpy/tool/commands/upload.py:407:  at least two spaces before inline comment  [pep8/E261] [5]
WebKitTools/Scripts/webkitpy/tool/commands/upload.py:427:  at least two spaces before inline comment  [pep8/E261] [5]
Total errors found: 6 in 24 files


If any of these errors are false positives, please file a bug against check-webkit-style.
Comment 12 Eric Seidel (no email) 2010-04-18 04:38:02 PDT
Comment on attachment 53625 [details]
Patch

Nope, not quite there.  Will post a better patch in teh morning.
Comment 13 Eric Seidel (no email) 2010-04-18 04:59:06 PDT
Created attachment 53626 [details]
Patch
Comment 14 Eric Seidel (no email) 2010-04-18 05:00:10 PDT
I think this one finally works.  I ended up having to rip out a bunch of StringIO plumbing.  Basically every time we use StringIO, we have to be sure to encode the unicode string to a utf-8 byte array so that StringIO can vend back byte arrays via its read() method.
Comment 15 Chris Jerdonek 2010-04-18 10:41:18 PDT
Thanks for taking this on and looking into this, Eric.  (Good slide show in comment 3, btw.)

A couple random comments before a couple comments on the patch:

(1) As the slides suggest, we may want to consider decoding using "utf-8-sig" as a general practice instead of "utf-8":

"To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig").... Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.  On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file."

(from http://docs.python.org/library/codecs.html#encodings-and-unicode )

(2) I came across this which might be useful somewhere (intelligent encoding auto-detection):

http://chardet.feedparser.org/
Comment 16 Adam Barth 2010-04-18 11:23:36 PDT
We should be careful about BOM proliferation.  It might confuse tools that don't understand it.
Comment 17 Chris Jerdonek 2010-04-18 11:31:19 PDT
(In reply to comment #16)
> We should be careful about BOM proliferation.  It might confuse tools that
> don't understand it.

Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which would remove the BOM.
Comment 18 Chris Jerdonek 2010-04-18 11:36:29 PDT
+++ b/WebKitTools/ChangeLog
@@ -1,3 +1,93 @@
+        We do not have to use u"" instead of "" because u"a" == "a" as expected
+        in Python.  Python will generate a warning to the console in cases where
+        a unicode() == str() operation cannot be performed.

Can you clarify what you are trying to say here?  It sounds like you might be saying more than you mean to say -- e.g. that the caller never needs to use unicode string literals, which is obviously not right.

+        All places which use StringIO need to be sure to pass StringIO a
+        pre-encoded byte-array (str object) instead of unicode so that
+        clients which read from the StringIO don't have encoding exceptions.

It seems like we shouldn't need to use StringIO as much as we do (except perhaps in unittest code to make certain tests easier).  It looks like you're removing a lot of the places it's being used which seems good.

@@ -117,7 +118,9 @@ class ChangeLog(object):
 
     def latest_entry(self):
         # ChangeLog files are always UTF-8, we read them in as such to support Reviewers with unicode in their names.
-        changelog_file = codecs.open(self.path, "r", "utf-8")
+        # We don't use codecs.open here to make the api for parse_latest_entry_from_file clearer.
+        # If we did, then it would be unclear as to whos reponsibility decoding of the file should be.
+        changelog_file = open(self.path, "r")
         try:
             return self.parse_latest_entry_from_file(changelog_file)

I understand what you're saying, but maybe the conclusion that should be drawn is that we shouldn't have parse_latest_entry_from_file() as part of our API?  In other words, the caller should be responsible.  That would also be more consistent with the rule of thumb to decode as early as possible.

+++ b/WebKitTools/Scripts/webkitpy/common/system/deprecated_logging.py
@@ -45,9 +45,10 @@ class tee:
     def __init__(self, *files):
         self.files = files
 
-    def write(self, string):
+    # Callers should pass an already encoded string for writing.
+    def write(self, bytes):
         for file in self.files:
-            file.write(string)
+            file.write(bytes)

Doesn't this also go against "unicode everywhere/encode late"?  Maybe this would be a good spot to do type-checking between unicode/str for backwards compatibility.

+++ b/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/metered_stream.py
@@ -56,6 +56,8 @@ class MeteredStream:
         self._stream = stream
         self._last_update = ""
 
+    # FIXME: Does this take a string (unicode) or an array of bytes (str)?
+    # If it takes a string, it needs to call txt.encode("utf-8")
     def write(self, txt):
         """Write text directly to the stream, overwriting and resetting the
         meter."""

I'm a bit unclear on the extent to which, for example, we'll need to be passing unicode strings even for every log message we create.  You've probably thought about this more than I have.

Out of curiosity, I started looking to see how Python's logging package handles this.  I stopped when it started to look like the logging package has no preference between str and unicode.  I think it might only be the particular logging handlers that care, which is what the caller has control over (when they configure logging).
Comment 19 Eric Seidel (no email) 2010-04-18 15:36:29 PDT
(In reply to comment #18)

Good questions!

> Can you clarify what you are trying to say here?  It sounds like you might be
> saying more than you mean to say -- e.g. that the caller never needs to use
> unicode string literals, which is obviously not right.

I was attempting to excuse why I didn't add u"" to every string literal in the project. :)  If you forget the u"" and it actually contains non-ascii code points, python will yell at you.

> It seems like we shouldn't need to use StringIO as much as we do (except
> perhaps in unittest code to make certain tests easier).  It looks like you're
> removing a lot of the places it's being used which seems good.

StringIO is just confusingly named.  It deals with a stream of bytes.  aka, a "str" object in Python 2.x.  In Python 3.0 "str" is actually a unicode() object, and there is an explicit bytes() type.  StringIO probably takes a bytes() object in 3.0. :)

There are few places we actually need StringIO, since we tend to follow the model of just reading everything into memory.  The files we deal with tend to be small, and the machines these scripts run on rather beefy.

> I understand what you're saying, but maybe the conclusion that should be drawn
> is that we shouldn't have parse_latest_entry_from_file() as part of our API? 
> In other words, the caller should be responsible.  That would also be more
> consistent with the rule of thumb to decode as early as possible.

Any API which takes a file-like object is rather confusing.  Because the read() method is expected to return a byte stream.  However you can use codecs.open() to create a file-like object which will return unicode() strings instead of str() byte streams from the various methods.

> Doesn't this also go against "unicode everywhere/encode late"?  Maybe this
> would be a good spot to do type-checking between unicode/str for backwards
> compatibility.

Nope.  .write() is expected to deal with byte streams.  Think of file.write()

> I'm a bit unclear on the extent to which, for example, we'll need to be passing
> unicode strings even for every log message we create.  You've probably thought
> about this more than I have.

Well, depends on the logging infrastructure.  As far as I can tell Python 2.5 "print" seems to be smart enough to handle unicode() objects.  file.write() not so much (unless you open the file with codecs.open and declare an encoding)  So depends on what sort of logging we're doing.  If we're logging to the console, it seems python is smart enough to handle unicode for us.  If we're logging to a file we may have to be careful to call .encode("utf-8") to create the appropriate byte stream first.

> Out of curiosity, I started looking to see how Python's logging package handles
> this.  I stopped when it started to look like the logging package has no
> preference between str and unicode.  I think it might only be the particular
> logging handlers that care, which is what the caller has control over (when
> they configure logging).

There is not difference between "str" and "unicode" as far as most of python is concerned.  I wouldn't expect the logging module to care, except if it's writing to files or to the console (however I suspect that when writing to the console the unicode encoding is abstracted at a lower layer).
Comment 20 Eric Seidel (no email) 2010-04-18 15:38:01 PDT
(In reply to comment #17)
> (In reply to comment #16)
> > We should be careful about BOM proliferation.  It might confuse tools that
> > don't understand it.
> 
> Just to be clear, I was suggesting using "utf-8-sig" only when reading -- which
> would remove the BOM.

I don't think we need to use utf-8-sig anywhere.

Our tools don't (and shouldn't) write out BOMs and we know the encodings of all our files.  It's possible that editing ChangeLog files in MSVC would spit out a BOM in the file.  If that's the case, we might have to use "utf-8-sig" when reading in ChangeLog files to be safe.
Comment 21 Tony Chang 2010-04-18 17:48:25 PDT
Comment on attachment 53626 [details]
Patch

In response to comment #2, I think it would work if you use decode instead of encode (the string you passed in is already utf8 encoded, you want to convert it to a unicode object, right?).

FWIW, I would probably have tried to keep everything as a utf8 encoded str, but using unicode everywhere seems fine too (you get some better type checking this way).  In either case, thanks for taking this on.

Some small style nits below.

> diff --git a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py
> index e93896f..862dc48 100644
> --- a/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py
> +++ b/WebKitTools/Scripts/webkitpy/common/checkout/changelog.py
> @@ -102,12 +102,13 @@ class ChangeLog(object):
>          date_line_regexp = re.compile(ChangeLogEntry.date_line_regexp)
>          entry_lines = []
>          # The first line should be a date line.
> -        first_line = changelog_file.readline()
> +        first_line = unicode(changelog_file.readline(), "utf-8")

Nit: I find using .decode('utf-8') easier to read than calling unicode(), but either is fine with me.  There's another instance of this in the same file.

> diff --git a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py
> index bb22d14..2dd4d20 100644
> --- a/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py
> +++ b/WebKitTools/Scripts/webkitpy/common/net/bugzilla.py
>      def _parse_attachment_element(self, element, bug_id):
> +

Nit: Did you mean to add this blank line?

> diff --git a/WebKitTools/Scripts/webkitpy/common/system/executive.py b/WebKitTools/Scripts/webkitpy/common/system/executive.py
> index b6126e4..cbe0e1f 100644
> --- a/WebKitTools/Scripts/webkitpy/common/system/executive.py
> +++ b/WebKitTools/Scripts/webkitpy/common/system/executive.py
> @@ -145,7 +157,7 @@ class Executive(object):
>              # os.kill isn't available on Windows.  However, when I tried it
>              # using Cygwin, it worked fine.  We should investigate whether
>              # we need this platform specific code here.
> -            subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)),
> +            subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)),

Nit: This can probably stay as str().  It's a bit weird to have the other strings be str and the pid be unicode.
Comment 22 Chris Jerdonek 2010-04-18 19:59:38 PDT
(In reply to comment #19)

Thanks for your responses, Eric.

> > I understand what you're saying, but maybe the conclusion that should be drawn
> > is that we shouldn't have parse_latest_entry_from_file() as part of our API? 
> > In other words, the caller should be responsible.  That would also be more
> > consistent with the rule of thumb to decode as early as possible.
> 
> Any API which takes a file-like object is rather confusing.  Because the read()
> method is expected to return a byte stream.  However you can use codecs.open()
> to create a file-like object which will return unicode() strings instead of
> str() byte streams from the various methods.

I guess what I was getting at here is that -- following the "decode early" mantra -- we might want to have parse_latest_entry_from_file() accept a file-like object from codecs.open() rather than a UTF-8 encoded stream.  That way parse_latest_entry_from_file() doesn't have to be encoding aware and will see only unicode strings.

I only bring this up because this issue will probably come up in more spots as we review more of our code.  Any time we define a function that takes a file-like object representing text, we'll need to decide if that function should take a UTF-8 encoded stream, say, or an encoding-aware stream from codecs.

(The codecs module blurs the recommendations in the slide show somewhat because you can "decode early" while still seeming to handle a stream of bytes.)

Incidentally, the Python documentation notes that in Python 3, "there is no longer any need for using the encoding-aware streams in the codecs module."

(from http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit )
Comment 23 Eric Seidel (no email) 2010-04-18 22:20:31 PDT
(In reply to comment #21)
> (From update of attachment 53626 [details])
> In response to comment #2, I think it would work if you use decode instead of
> encode (the string you passed in is already utf8 encoded, you want to convert
> it to a unicode object, right?).

My example was slightly confusing, and I was myself slightly confused.  Now that I see "str" as an array of bytes and not actually a string type the world makes a lot more sense.  I was attempting to unconditionally encode any "string like object" to a utf-8 bytestream before passing off to the Popen call.  This meant that if I was passing around utf8-encoded str objects, and tried to call "encode" on them, it would scream at me.

> FWIW, I would probably have tried to keep everything as a utf8 encoded str, but
> using unicode everywhere seems fine too (you get some better type checking this
> way).  In either case, thanks for taking this on.

Yes, it's true, that could work.  However there are already many APIs which feed us unicode() objects and so we'd have to be careful to be sure to convert those back to utf-8 byte stream (str) objects before passing them around or comparisons would fail.

I decided to go with the unicode() solution, mostly because it seemed the most consistent in my head.  Turns out it's also the recommended way to prepare for Python 3.0 (which we'll eventually get to in a year or more). :)

> > -        first_line = changelog_file.readline()
> > +        first_line = unicode(changelog_file.readline(), "utf-8")
> 
> Nit: I find using .decode('utf-8') easier to read than calling unicode(), but
> either is fine with me.  There's another instance of this in the same file.

Yes.  The only difference is handling of None.  However I agree in this case decode is definitely nicer.

> >      def _parse_attachment_element(self, element, bug_id):
> > +
> 
> Nit: Did you mean to add this blank line?

Nope.  Will fix.

> > -            subprocess.call(('taskkill.exe', '/f', '/pid', str(pid)),
> > +            subprocess.call(('taskkill.exe', '/f', '/pid', unicode(pid)),
> 
> Nit: This can probably stay as str().  It's a bit weird to have the other
> strings be str and the pid be unicode.

It certainly could.  I was trying to move away from ever using the str() constructor.  str() is basically always bad news, even when we use it to mean "make this number a string".
Comment 24 Eric Seidel (no email) 2010-04-18 22:23:37 PDT
(In reply to comment #22)
> I guess what I was getting at here is that -- following the "decode early"
> mantra -- we might want to have parse_latest_entry_from_file() accept a
> file-like object from codecs.open() rather than a UTF-8 encoded stream.  That
> way parse_latest_entry_from_file() doesn't have to be encoding aware and will
> see only unicode strings.

Hmm... Maybe.  Seems a bit magical.  We'd have to assert that .read() returned isinstance(unicode) values or something to be sure.

You're right that it would make the code read more like Python 3.0 would. :)

> I only bring this up because this issue will probably come up in more spots as
> we review more of our code.  Any time we define a function that takes a
> file-like object representing text, we'll need to decide if that function
> should take a UTF-8 encoded stream, say, or an encoding-aware stream from
> codecs.

Yes.  And your point is valid.  Not sure how to make that work with StringIO for unittesting though.  If I remember correctly StringIO(unicode(...)) will throw exceptions when read() is called.

> (The codecs module blurs the recommendations in the slide show somewhat because
> you can "decode early" while still seeming to handle a stream of bytes.)
> 
> Incidentally, the Python documentation notes that in Python 3, "there is no
> longer any need for using the encoding-aware streams in the codecs module."
> 
> (from
> http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
> )

Thank you. :)  Very helpful comments as always.
Comment 25 Chris Jerdonek 2010-04-19 13:52:05 PDT
(In reply to comment #24)
> Yes.  And your point is valid.  Not sure how to make that work with StringIO
> for unittesting though.  If I remember correctly StringIO(unicode(...)) will
> throw exceptions when read() is called.

I haven't tried this out myself to check, but the Python documentation seems to say this is okay as long as one doesn't mix the two:

http://docs.python.org/library/stringio.html#StringIO.StringIO
Comment 26 Eric Seidel (no email) 2010-04-19 15:02:16 PDT
You're right,  StringIO(unicode()) seems to work fine:

>>> import StringIO
>>> tor = u"Vestb\u00f8"
>>> print tor
Vestbø
>>> tor_io =StringIO.StringIO(tor)
>>> read_tor =tor_io.read()
>>> print read_tor
Vestbø
>>> print read_tor.__class__
<type 'unicode'>
Comment 27 Eric Seidel (no email) 2010-04-19 18:33:55 PDT
I can haz r+ or r- plz?
Comment 28 Adam Barth 2010-04-20 11:58:36 PDT
Looking.
Comment 29 Adam Barth 2010-04-20 12:08:57 PDT
Comment on attachment 53626 [details]
Patch

Oh man.  This looks like a good approach.  Now we get to see how good our test coverage is.

Btw, we should write up a doc about reviewing/writing webkitpy code that explains these kinds of constraints we have going forward.
Comment 30 Chris Jerdonek 2010-04-20 12:19:30 PDT
(In reply to comment #29)
> Btw, we should write up a doc about reviewing/writing webkitpy code that
> explains these kinds of constraints we have going forward.

I'm not sure what kind of doc you're envisioning, but I've started to put general information about WebKit Python and webkitpy here:

http://trac.webkit.org/wiki/PythonGuidelines

(linked under the "Documentation" section on the main wiki page)
Comment 31 Tor Arne Vestbø 2010-04-20 12:25:56 PDT
Much love for looking into this Eric!
Comment 32 Adam Barth 2010-04-20 12:31:06 PDT
That wiki page looks great.  Would you be willing to add some info about our Unicode strategy and good and bad patterns?
Comment 33 Chris Jerdonek 2010-04-20 12:44:29 PDT
(In reply to comment #32)
> That wiki page looks great.  Would you be willing to add some info about our
> Unicode strategy and good and bad patterns?

Sure, I could do that.  For now I stubbed out a section with a link to this comment thread.  But don't let that stop anyone else from adding to that section if I don't get to it right away. :)
Comment 34 Eric Seidel (no email) 2010-04-20 12:55:00 PDT
Committed r57907: <http://trac.webkit.org/changeset/57907>
Comment 36 Eric Seidel (no email) 2010-04-20 14:30:40 PDT
Reverted r57907 for reason:

Appears to have broken MacEWS and possibly webkit-patch upload for svn users.  Needs further investigation.

Committed r57920: <http://trac.webkit.org/changeset/57920>
Comment 37 Eric Seidel (no email) 2010-04-20 14:31:52 PDT
The chromium bots also saw a failure in new-run-webkit-tests:
http://build.chromium.org/buildbot/waterfall.fyi/builders/Webkit%20Linux%20(webkit.org)/builds/23994/steps/webkit_tests/logs/stdio
Comment 38 Dirk Pranke 2010-04-20 14:50:39 PDT
So, to summarize the thinking in this patch:

1) assume all text files are encoded on disk as UTF-8
2) read files using codecs.open(..., "r", encoding="utf-8")
3) write files using codecs.open(..., "w", enocding="utf-8"), write(s.encode("utf-8")) 
    if s is a string that is not known to be encoded already

Correct?

It would be good if we provided wrappers for (2) and (3) that handled this automatically and centralized the open and close logic in one place. 

Perhaps the read_file() and write_file() utility routines in dryrun.py that are supposed to be split off into their own file would be a good start on standardizing this.
Comment 39 Eric Seidel (no email) 2010-04-20 14:55:36 PDT
(In reply to comment #38)
> So, to summarize the thinking in this patch:
> 
> 1) assume all text files are encoded on disk as UTF-8

More precisely: be explicit about how things are encoded, whatever that encoding may be.

> 2) read files using codecs.open(..., "r", encoding="utf-8")

Possibly.  This patch uses a combination.  codecs.open makes file object return unicode() strings instead of str byte arrays from read() and friends.  We use that in some palces.  cjerdonek was suggesting we standardize, but I didn't do a good job of that in this patch.

> 3) write files using codecs.open(..., "w", enocding="utf-8"),
> write(s.encode("utf-8")) 
>     if s is a string that is not known to be encoded already

One or the other, not both.  if you open with codecs.open it will do the magic for you.  if you open with open() you have to encode to a byte array (str) before calling write.

> It would be good if we provided wrappers for (2) and (3) that handled this
> automatically and centralized the open and close logic in one place. 
> 
> Perhaps the read_file() and write_file() utility routines in dryrun.py that are
> supposed to be split off into their own file would be a good start on
> standardizing this.

Possibly.  I think Chris is probably right.  We should probably standarize on using codecs.open everywhere and then be able to assume that read/write handle unicode (like they do in Python 3.0).
Comment 40 Dirk Pranke 2010-04-20 15:10:13 PDT
(In reply to comment #39)
> (In reply to comment #38)
> > So, to summarize the thinking in this patch:
> > 
> > 1) assume all text files are encoded on disk as UTF-8
> 
> More precisely: be explicit about how things are encoded, whatever that
> encoding may be.
> 

I was (perhaps badly) attempting to be stronger than that. What I meant was, previously we had probably assumed that all of the files are ASCII. We don't want to assume that, but since we don't want to try and do encoding detection, we should probably assume something, and that should probably be UTF-8.

> > 2) read files using codecs.open(..., "r", encoding="utf-8")
> 
> Possibly.  This patch uses a combination.  codecs.open makes file object return
> unicode() strings instead of str byte arrays from read() and friends.  We use
> that in some palces.  cjerdonek was suggesting we standardize, but I didn't do
> a good job of that in this patch.

I agree with Chris we should standardize, although it doesn't have to happen in this patch. We should file a separate bug to clean up the rest of the open() calls.

> 
> > 3) write files using codecs.open(..., "w", enocding="utf-8"),
> > write(s.encode("utf-8")) 
> >     if s is a string that is not known to be encoded already
> 
> One or the other, not both.  if you open with codecs.open it will do the magic
> for you.  if you open with open() you have to encode to a byte array (str)
> before calling write.

Okay. I wasn't clear about that.
Comment 41 Eric Seidel (no email) 2010-04-20 15:14:52 PDT
(In reply to comment #40)
> (In reply to comment #39)
> > (In reply to comment #38)
> > > So, to summarize the thinking in this patch:
> > > 
> > > 1) assume all text files are encoded on disk as UTF-8
> > 
> > More precisely: be explicit about how things are encoded, whatever that
> > encoding may be.
> > 
> 
> I was (perhaps badly) attempting to be stronger than that. What I meant was,
> previously we had probably assumed that all of the files are ASCII. We don't
> want to assume that, but since we don't want to try and do encoding detection,
> we should probably assume something, and that should probably be UTF-8.

Except they weren't ascii. :)  The exception Chromium hit when this was landed was due to the diff-processing code trying to write out a unicode string.  Previously we were blissfully unaware that diff/wdiff/prettypatch were returning us utf-8 encoded data and we were passing it around as str()s.

After this patch, we convert data returned from commands back into unicode() strings.  Making some things easier, but requiring us to be explicit about converting our unicode strings back to byte arrays before writing them out to files (or using codecs.open to make our write() calls transparently convert).
Comment 42 WebKit Review Bot 2010-04-20 15:17:03 PDT
http://trac.webkit.org/changeset/57920 might have broken Leopard Intel Release (Tests)
The following changes are on the blame list:
http://trac.webkit.org/changeset/57920
http://trac.webkit.org/changeset/57921
Comment 43 Dirk Pranke 2010-04-20 15:21:36 PDT
(In reply to comment #41)
> (In reply to comment #40)
> > (In reply to comment #39)
> > > (In reply to comment #38)
> > > > So, to summarize the thinking in this patch:
> > > > 
> > > > 1) assume all text files are encoded on disk as UTF-8
> > > 
> > > More precisely: be explicit about how things are encoded, whatever that
> > > encoding may be.
> > > 
> > 
> > I was (perhaps badly) attempting to be stronger than that. What I meant was,
> > previously we had probably assumed that all of the files are ASCII. We don't
> > want to assume that, but since we don't want to try and do encoding detection,
> > we should probably assume something, and that should probably be UTF-8.
> 
> Except they weren't ascii. :)  

Agreed, obviously. I wasn't trying to suggest there weren't bugs we were runing into.
Comment 44 Eric Seidel (no email) 2010-04-21 03:33:19 PDT
Created attachment 53937 [details]
Fixed to be way more awesome
Comment 45 Eric Seidel (no email) 2010-04-21 03:35:57 PDT
I've updated the patch to deploy "codecs.open" everywhere, as per Chris's suggestion.  While I was looking at every open() call, I converted a bunch of them to use "with" to avoid file descriptor leaks.

The patch is huge now.  However, it does fix the 3 regressions seen when landing the last patch:
1. mac-ews was failing due to having changed how --status-host was passed, our global argument parsing is a hack, at least I've documented that now.
2.  webkit-patch upload was broken for svn users due to the svn-create-patch call not passing its arguments as an array.
3.  new-run-webkit-tests was broken for chromium due to diff creation logic not specifying an encoding.
Comment 46 Adam Barth 2010-04-21 13:46:06 PDT
Comment on attachment 53937 [details]
Fixed to be way more awesome

This change is too large to convince myself that it's correct, but the approach seems good and we want to do it all at once.

One thought: how much of this can we wrap in a filesystem abstraction?  I discussed that idea with Dirk on another bug.  I don't think we need to do that in this patch, but it's something to keep in mind.
Comment 47 Eric Seidel (no email) 2010-04-21 13:50:05 PDT
The file system abstraction is really codecs.open()  I learned at the end of this change that I could have even deployed codecs.open wider.

We should *never* be using open() directly.  codecs.open takes an encoding parameter (just like Python 3.x open() does).  encoding=None results in open()-like, return-a-str-object behavior.

So in a followup patch we should remove all uses of raw open().

We should also avoid ever closing a file manually.  Doing so will fail in teh case of an exception, which is why we use with blocks. :)
Comment 48 Dirk Pranke 2010-04-21 14:06:25 PDT
(In reply to comment #47)
> The file system abstraction is really codecs.open()  I learned at the end of
> this change that I could have even deployed codecs.open wider.
> 
> We should *never* be using open() directly.  codecs.open takes an encoding
> parameter (just like Python 3.x open() does).  encoding=None results in
> open()-like, return-a-str-object behavior.
> 
> So in a followup patch we should remove all uses of raw open().
> 
> We should also avoid ever closing a file manually.  Doing so will fail in teh
> case of an exception, which is why we use with blocks. :)

Note that with blocks only do try/finally; you should still be aware of needing to do try/catch if you don't want the exceptions to propagate.
Comment 49 Chris Jerdonek 2010-04-21 14:18:05 PDT
(In reply to comment #47)

> We should *never* be using open() directly.  codecs.open takes an encoding
> parameter (just like Python 3.x open() does).  encoding=None results in
> open()-like, return-a-str-object behavior.

I just wanted to bring attention to a couple other uses of the codecs module that I noticed in check-webkit-style -- in case they might be useful elsewhere in webkitpy.  Mainly this code--

    # Change stderr to write with replacement characters so we don't die
    # if we try to print something containing non-ASCII characters.
    stderr = codecs.StreamReaderWriter(sys.stderr,
                                       codecs.getreader('utf8'),
                                       codecs.getwriter('utf8'),
                                       'replace')
    # Setting an "encoding" attribute on the stream is necessary to
    # prevent the logging module from raising an error.  See
    # the checker.configure_logging() function for more information.
    stderr.encoding = "UTF-8"

    # FIXME: Change webkitpy.style so that we do not need to overwrite
    #        the global sys.stderr.  This involves updating the code to
    #        accept a stream parameter where necessary, and not calling
    #        sys.stderr explicitly anywhere.
    sys.stderr = stderr

(This is from http://trac.webkit.org/browser/trunk/WebKitTools/Scripts/check-webkit-style?rev=57467#L61 )

IIRC, this codecs.StreamReaderWriter code was present in its basic form from the beginning.  I'm not sure when we might want to use this pattern over codecs.open(), but perhaps it might be useful elsewhere.

Note also the use of the errors='replace' parameter, which codecs.open() also accepts.  The "errors" parameter in these methods controls how bad bytes should be handled when encountered while reading from or writing to a stream.  There may be cases where we want to be more lenient than the default value of errors='strict' and continue processing a file if possible.  It doesn't look like the errors parameter is being used in any of the codecs.open() calls in this patch (which could very well be exactly what we want).
Comment 50 Eric Seidel (no email) 2010-04-21 15:04:36 PDT
Committed r58014: <http://trac.webkit.org/changeset/58014>
Comment 51 Eric Seidel (no email) 2010-04-21 15:47:27 PDT
Created attachment 53998 [details]
Fix Chromium Canaries
Comment 52 Eric Seidel (no email) 2010-04-21 15:52:39 PDT
Committed r58016: <http://trac.webkit.org/changeset/58016>
Comment 53 Eric Seidel (no email) 2010-04-21 16:31:39 PDT
Committed r58023: <http://trac.webkit.org/changeset/58023>
Comment 54 Eric Seidel (no email) 2010-04-21 18:23:39 PDT
Was rolled out a second time.
Comment 55 Eric Seidel (no email) 2010-04-21 18:34:40 PDT
Committed r58036: <http://trac.webkit.org/changeset/58036>
Comment 56 Eric Seidel (no email) 2010-04-21 20:10:29 PDT
Created attachment 54018 [details]
Fix for run-webkit-tests --chromium
Comment 57 Eric Seidel (no email) 2010-04-21 20:17:23 PDT
Committed r58044: <http://trac.webkit.org/changeset/58044>
Comment 58 Eric Seidel (no email) 2010-04-21 20:40:51 PDT
Created attachment 54019 [details]
Patch
Comment 59 Eric Seidel (no email) 2010-04-21 21:31:42 PDT
Committed r58050: <http://trac.webkit.org/changeset/58050>
Comment 60 Eric Seidel (no email) 2010-04-21 22:41:48 PDT
Committed r58054: <http://trac.webkit.org/changeset/58054>
Comment 61 Eric Seidel (no email) 2010-04-21 22:50:00 PDT
Committed r58055: <http://trac.webkit.org/changeset/58055>
Comment 62 Eric Seidel (no email) 2010-04-21 23:20:04 PDT
Committed r58059: <http://trac.webkit.org/changeset/58059>
Comment 63 Eric Seidel (no email) 2010-04-21 23:30:38 PDT
Committed r58060: <http://trac.webkit.org/changeset/58060>
Comment 64 Eric Seidel (no email) 2010-04-21 23:39:32 PDT
Committed r58062: <http://trac.webkit.org/changeset/58062>
Comment 65 Eric Seidel (no email) 2010-04-22 03:29:25 PDT
Chromium is reporting random hangs in new-run-webkit-tests --chromium after all these changes.

I'm not sure what could be causing them yet.  We really need a backtrace from the hang.
Comment 66 Eric Seidel (no email) 2010-04-22 03:32:43 PDT
http://trac.webkit.org/changeset/58059 looks the most suspicious for possibly causing hangs, but again we need a reproduction including backtrace to really know what's getting hung.  Some print-based debugging might lead us in teh right direciton.

I'm headed to bed soon, but I'm happy to look at this first thing in the morning (well, probably afternoon at this point).
Comment 67 Eric Seidel (no email) 2010-04-22 03:58:23 PDT
(In reply to comment #66)
> http://trac.webkit.org/changeset/58059 looks the most suspicious for possibly
> causing hangs, but again we need a reproduction including backtrace to really
> know what's getting hung.  Some print-based debugging might lead us in teh
> right direciton.
> 
> I'm headed to bed soon, but I'm happy to look at this first thing in the
> morning (well, probably afternoon at this point).

That change probably made us vulernable to http://bugs.python.org/issue1236 again.  Which I had mistakenly thought was a 2.4 only bug.  Perhaps one of the orignial new-run-webkit-test authors can comment on how http://bugs.python.org/issue1236 manifests?  Could it be the hangs Yuzo is seeing?

Again, we just need a backtrace and then solving the hang shoudl be simple. :)
Comment 68 Dirk Pranke 2010-04-22 18:53:22 PDT
(In reply to comment #62)
> Committed r58059: <http://trac.webkit.org/changeset/58059>

This change doesn't actually call format_wdiff_output_as_html() ... presumably this breaks wdiff output?
Comment 69 Fumitoshi Ukai 2010-04-22 22:26:27 PDT
when i run test_webkitpy on cygwin,


test-webkitpy: INFO     Suppressing most webkitpy logging while running unit tests.
webkitpy.test.main: INFO     Excluding: webkitpy.common.checkout.scm_unittest (use --all to include)
...

======================================================================
ERROR: Validate that it is safe to pass unicode() objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/common/system/executive_unittest.py", line 53, in test_run_command_with_unicode
    output = executive.run_command(["echo", "-n", unicode_tor])
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/common/system/executive.py", line 211, in run_command
    cwd=cwd)
  File "/usr/lib/python2.5/subprocess.py", line 594, in __init__
    errread, errwrite)
  File "/usr/lib/python2.5/subprocess.py", line 1091, in _execute_child
    raise child_exception
TypeError: execv() arg 2 must contain only strings

======================================================================
FAIL: test_basics (webkitpy.layout_tests.run_webkit_tests_unittest.DryrunTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 63, in test_basics
    'fast/html']))
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/layout_tests/run_webkit_tests_unittest.py", line 41, in passing_run
    res = run_webkit_tests.main(options, args, False)
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1529, in main
    options.configuration = port_obj.default_configuration()
  File "/cygdrive/c/Users/ukai/src/chromium-webkit/src/third_party/WebKit/WebKit
Tools/Scripts/webkitpy/layout_tests/port/webkit.py", line 304, in default_configuration
    assert(configuration == "Debug" or configuration == "Release")
AssertionError

$ python
Python 2.5.2 (r252:60911, Dec  2 2008, 09:26:14)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.124)] on cygwin

should i use more recent version of python, cygwin?