Bug 45791 - NRWT fails with UnicodeDecodeError on editing/selection/mixed-editability-10.html
Summary: NRWT fails with UnicodeDecodeError on editing/selection/mixed-editability-10....
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: Tools / Tests (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC OS X 10.5
: P2 Normal
Assignee: Dirk Pranke
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-14 16:23 PDT by Mihai Parparita
Modified: 2010-09-15 10:47 PDT (History)
7 users (show)

See Also:


Attachments
Patch (2.90 KB, patch)
2010-09-14 17:47 PDT, Mihai Parparita
no flags Details | Formatted Diff | Diff
Patch (3.51 KB, patch)
2010-09-14 19:43 PDT, Mihai Parparita
mihaip: review-
mihaip: commit-queue-
Details | Formatted Diff | Diff
Patch (3.86 KB, patch)
2010-09-14 20:27 PDT, Mihai Parparita
no flags Details | Formatted Diff | Diff
Patch (3.81 KB, patch)
2010-09-15 08:58 PDT, Mihai Parparita
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mihai Parparita 2010-09-14 16:23:44 PDT
Stack trace:
  File "./WebKitTools/Scripts/new-run-webkit-tests", line 38, in <module>
    sys.exit(run_webkit_tests.main())
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1722, in main
    return run(port_obj, options, args)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1425, in run
    num_unexpected_results = test_runner.run(result_summary)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 780, in run
    self._run_tests(self._test_files_list, result_summary))
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 651, in _run_tests
    result_summary)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 619, in _instantiate_dump_render_tree_threads
    thread.run_in_main_thread(self, result_summary)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 344, in run_in_main_thread
    self._run(test_runner, result_summary)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 392, in _run
    result = self._run_test(test_info)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 510, in _run_test
    output, error)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 125, in _process_output
    configuration)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/test_types/text_diff.py", line 104, in compare_output
    print_text_diffs=True)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py", line 204, in write_output_files
    diff = port.diff_text(expected, output, expected_filename, actual_filename)
  File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/port/base.py", line 159, in diff_text
    return ''.join(diff)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)
Comment 1 Dirk Pranke 2010-09-14 17:41:22 PDT
The problem seems to be that we're passing in unicode strings for the filenames and raw strings for the content to diff. That results in the unified output being a mixture of the two datatypes, which causes ''.join() to do encoding conversions using the default 'ascii' converter, which keels over.

The easiest fix is probably to convert the filenames to raw strings.
Comment 2 Mihai Parparita 2010-09-14 17:47:48 PDT
Created attachment 67626 [details]
Patch
Comment 3 Tony Chang 2010-09-14 17:52:13 PDT
Comment on attachment 67626 [details]
Patch

Should test_expectations.txt be updated as well or does the test still fail?
Comment 4 Mihai Parparita 2010-09-14 17:55:29 PDT
Test still fails, NRWT was just barfing when outputting the diff after the test had run.
Comment 5 Ojan Vafai 2010-09-14 17:58:37 PDT
I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too.

if isinstance(value, unicode):
    value = value.encode('utf-8')
Comment 6 Dirk Pranke 2010-09-14 18:05:59 PDT
I'm fine w/ the patch as is. I'm not sure that Ojan's mod is worth the extra verbosity.
Comment 7 Tony Chang 2010-09-14 18:12:31 PDT
(In reply to comment #5)
> I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too.
> 
> if isinstance(value, unicode):
>     value = value.encode('utf-8')

|value| will always be a unicode string, won't it?  Explicitly calling encode seems a bit better.
Comment 8 Dirk Pranke 2010-09-14 18:16:46 PDT
(In reply to comment #7)
> (In reply to comment #5)
> > I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too.
> > 
> > if isinstance(value, unicode):
> >     value = value.encode('utf-8')
> 
> |value| will always be a unicode string, won't it?  Explicitly calling encode seems a bit better.

I assumed Ojan's comment was pseudocode out of context. There isn't actually a 'value' variable anywhere in the code AFAIK.
Comment 9 Eric Seidel (no email) 2010-09-14 19:05:46 PDT
Comment on attachment 67626 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch

> WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160
> +        expected_filename = str(expected_filename)
> +        actual_filename = str(actual_filename)
This will take expected_filename and try to decode it as ascii.  Is that what you intended?  If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want?  (At least that's my understanding.)

> WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:175
> +        port.diff_text('exp', 'act', 'exp.txt', 'act.txt')
> +        port.diff_text('exp', 'act', u'exp.txt', 'act.txt')
> +
> +        port.diff_text('exp' + chr(255), 'act', 'exp.txt', 'act.txt')
> +        port.diff_text('exp' + chr(255), 'act', u'exp.txt', 'act.txt')
Seems we need a test with unicode characters here.

I don't think this is 100% right, but please correct me if I'm wrong. :)
Comment 10 Dirk Pranke 2010-09-14 19:24:20 PDT
(In reply to comment #9)
> (From update of attachment 67626 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch
> 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160
> > +        expected_filename = str(expected_filename)
> > +        actual_filename = str(actual_filename)
> This will take expected_filename and try to decode it as ascii.  Is that what you intended?  If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want?  (At least that's my understanding.)
> 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:175
> > +        port.diff_text('exp', 'act', 'exp.txt', 'act.txt')
> > +        port.diff_text('exp', 'act', u'exp.txt', 'act.txt')
> > +
> > +        port.diff_text('exp' + chr(255), 'act', 'exp.txt', 'act.txt')
> > +        port.diff_text('exp' + chr(255), 'act', u'exp.txt', 'act.txt')
> Seems we need a test with unicode characters here.
> 
> I don't think this is 100% right, but please correct me if I'm wrong. :)

The second line gives us the test with unicode in it. Really, we only needed those two. The third and fourth test to make sure that we're *not* doing an ascii conversion, because those would choke on a non-ascii character (the original test contained non-ascii characters in the expected and actual strings).
Comment 11 Dirk Pranke 2010-09-14 19:25:40 PDT
(In reply to comment #9)
> (From update of attachment 67626 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch
> 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160
> > +        expected_filename = str(expected_filename)
> > +        actual_filename = str(actual_filename)
> This will take expected_filename and try to decode it as ascii.  Is that what you intended?  If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want?  (At least that's my understanding.)
> 

Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters).
Comment 12 Mihai Parparita 2010-09-14 19:27:32 PDT
(In reply to comment #11)
> (In reply to comment #9)
> > (From update of attachment 67626 [details] [details])
> > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch
> > 
> > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160
> > > +        expected_filename = str(expected_filename)
> > > +        actual_filename = str(actual_filename)
> > This will take expected_filename and try to decode it as ascii.  Is that what you intended?  If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want?  (At least that's my understanding.)
> > 
> 
> Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters).

However, Eric is right that str() on a unicode string with non-ASCII characters will thrown an exception. Using filename.encode('utf-8') (as Ojan suggested) is safer. Will upload a new patch.
Comment 13 Dirk Pranke 2010-09-14 19:35:24 PDT
(In reply to comment #12)
> (In reply to comment #11)
> > (In reply to comment #9)
> > > (From update of attachment 67626 [details] [details] [details])
> > > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch
> > > 
> > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160
> > > > +        expected_filename = str(expected_filename)
> > > > +        actual_filename = str(actual_filename)
> > > This will take expected_filename and try to decode it as ascii.  Is that what you intended?  If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want?  (At least that's my understanding.)
> > > 
> > 
> > Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters).
> 
> However, Eric is right that str() on a unicode string with non-ASCII characters will thrown an exception. Using filename.encode('utf-8') (as Ojan suggested) is safer. Will upload a new patch.

I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in.
Comment 14 Mihai Parparita 2010-09-14 19:43:51 PDT
Created attachment 67635 [details]
Patch
Comment 15 Eric Seidel (no email) 2010-09-14 20:16:40 PDT
(In reply to comment #13)
> I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in.

We have in the past had non-ascii file names.  I believe they had to be removed when the windows port came online.  I remember adding some since, but then again having to remove them since they broke on windows.  But the general goal of supporting unicode urls and filenames is a good one.

We should assume unicode whenever we're dealing with "strings".  The str() type in Python was renamed ByteArray() in Python 3. and unicode() became the default for "foo".  The u"foo" syntax no longer exists in Python 3.

This is part of how we always want to use codecs.open for dealing with files and always pass an encoding.  codecs.open became open() in Python 3, again as part of moving to real unicode support throughout the interpreter.

(I've not used Python 3, I've just read the unicode powerpoints/docs online as part of the grand "make webkitpy understand unicode, dammit!" project I undertook 9 months ago to make Tor Arne Vestbø not break the commit-queue every time he was involved with a patch. :)
Comment 16 Mihai Parparita 2010-09-14 20:23:35 PDT
Comment on attachment 67635 [details]
Patch

Missed a return statement in to_raw_bytes
Comment 17 Mihai Parparita 2010-09-14 20:27:29 PDT
Created attachment 67639 [details]
Patch
Comment 18 Tony Chang 2010-09-14 20:58:56 PDT
(In reply to comment #15)
> (In reply to comment #13)
> > I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in.
> 
> We have in the past had non-ascii file names.  I believe they had to be removed when the windows port came online.  I remember adding some since, but then again having to remove them since they broke on windows.  But the general goal of supporting unicode urls and filenames is a good one.

Where do these filenames come from?  On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s.

Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames.  If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters.
Comment 19 Mihai Parparita 2010-09-14 21:44:44 PDT
(In reply to comment #18)
> Where do these filenames come from?  On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s.
> 
> Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames.  If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters.

I just wanted a string with not-valid-in-ASCII characters in them, I got it from http://docs.python.org/howto/unicode.html#unicode-literals-in-python-source-code. If the filenames are not unicode (and just a stream of bytes, i.e. str), then this will degrade fine.
Comment 20 Eric Seidel (no email) 2010-09-14 22:36:33 PDT
(In reply to comment #18)
> Where do these filenames come from?  On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s.

Mac HFS+ is always utf8 for all paths.  But I realize that is not true for all platforms.

We've had the non-ascii file names at various times because we need to be able to test handling of non-ascii URLs both locally and remotely for WebKit.  There are also handling unicode-aware parsing algorithms for element attributes, etc.  In general it's not a huge issue, but we ended up removing these tests due to Windows troubles iirc.

> Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames.  If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters.

Agreed.  Still is nice to attempt to be sane about unicode.

I think the confusion comes here because I read the patch too quickly and didn't realize Mihai's char(255) were how he was getting unicode code points.  I'm used to using u"foo\u1235bar" to get unicode codepoints in python.
Comment 21 Eric Seidel (no email) 2010-09-14 22:39:29 PDT
Comment on attachment 67639 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=67639&action=prettypatch

I really need to write some documentation for dealing with unicode vs. byte arrays in webkitpy.

> WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:162
> +        def to_raw_bytes(str):
> +            if isinstance(str, unicode):
> +                return str.encode('utf-8')
> +            return str
I suspect other areas of the code I've assumed that file names were unicode strings.  If that's not a valid assumption across platforms we're going to have more of these bugs.  We may need to make a module-wide fix to this.

> WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:170
> +        # actual input is always a raw string, not unicode).
I would have said "raw bytes" instead of "raw string", but that's a total nit.

> WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:180
> +        # be safe
safe.
Comment 22 Mihai Parparita 2010-09-15 08:58:00 PDT
Created attachment 67681 [details]
Patch
Comment 23 Mihai Parparita 2010-09-15 08:59:08 PDT
(In reply to comment #21)
> I suspect other areas of the code I've assumed that file names were unicode strings.  If that's not a valid assumption across platforms we're going to have more of these bugs.  We may need to make a module-wide fix to this.

It's a valid assumption, until we try to combine filenames with their content. Diff output is most likely the only place where we try to do that.
 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:170
> > +        # actual input is always a raw string, not unicode).
> I would have said "raw bytes" instead of "raw string", but that's a total nit.

Fixed (in the ChangeLog too)
 
> > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:180
> > +        # be safe
> safe.

Fixed.
Comment 24 WebKit Commit Bot 2010-09-15 10:06:22 PDT
Comment on attachment 67681 [details]
Patch

Clearing flags on attachment: 67681

Committed r67560: <http://trac.webkit.org/changeset/67560>
Comment 25 WebKit Commit Bot 2010-09-15 10:06:28 PDT
All reviewed patches have been landed.  Closing bug.
Comment 26 WebKit Review Bot 2010-09-15 10:41:34 PDT
http://trac.webkit.org/changeset/67560 might have broken GTK Linux 32-bit Release
The following changes are on the blame list:
http://trac.webkit.org/changeset/67560
http://trac.webkit.org/changeset/67561
Comment 27 Mihai Parparita 2010-09-15 10:47:48 PDT
(In reply to comment #26)
> http://trac.webkit.org/changeset/67560 might have broken GTK Linux 32-bit Release
> The following changes are on the blame list:
> http://trac.webkit.org/changeset/67560
> http://trac.webkit.org/changeset/67561

Since accessibility/ellipsis-text.html failed, I assume it's r67561's fault.