45791 – NRWT fails with UnicodeDecodeError on editing/selection/mixed-editability-10.html

Mihai Parparita

Reported 2010-09-14 16:23:44 PDT

Stack trace: File "./WebKitTools/Scripts/new-run-webkit-tests", line 38, in <module> sys.exit(run_webkit_tests.main()) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1722, in main return run(port_obj, options, args) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 1425, in run num_unexpected_results = test_runner.run(result_summary) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 780, in run self._run_tests(self._test_files_list, result_summary)) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 651, in _run_tests result_summary) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/run_webkit_tests.py", line 619, in _instantiate_dump_render_tree_threads thread.run_in_main_thread(self, result_summary) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 344, in run_in_main_thread self._run(test_runner, result_summary) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 392, in _run result = self._run_test(test_info) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 510, in _run_test output, error) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/layout_package/dump_render_tree_thread.py", line 125, in _process_output configuration) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/test_types/text_diff.py", line 104, in compare_output print_text_diffs=True) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/test_types/test_type_base.py", line 204, in write_output_files diff = port.diff_text(expected, output, expected_filename, actual_filename) File "/Users/mihaip/Developer/source/chromium1/src/third_party/WebKit/WebKitTools/Scripts/webkitpy/layout_tests/port/base.py", line 159, in diff_text return ''.join(diff) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)

Dirk Pranke

Comment 1 2010-09-14 17:41:22 PDT

The problem seems to be that we're passing in unicode strings for the filenames and raw strings for the content to diff. That results in the unified output being a mixture of the two datatypes, which causes ''.join() to do encoding conversions using the default 'ascii' converter, which keels over. The easiest fix is probably to convert the filenames to raw strings.

Mihai Parparita

Comment 2 2010-09-14 17:47:48 PDT

Created attachment 67626 [details] Patch

Tony Chang

Comment 3 2010-09-14 17:52:13 PDT

Comment on attachment 67626 [details] Patch Should test_expectations.txt be updated as well or does the test still fail?

Mihai Parparita

Comment 4 2010-09-14 17:55:29 PDT

Test still fails, NRWT was just barfing when outputting the diff after the test had run.

Ojan Vafai

Comment 5 2010-09-14 17:58:37 PDT

I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too. if isinstance(value, unicode): value = value.encode('utf-8')

Dirk Pranke

Comment 6 2010-09-14 18:05:59 PDT

I'm fine w/ the patch as is. I'm not sure that Ojan's mod is worth the extra verbosity.

Tony Chang

Comment 7 2010-09-14 18:12:31 PDT

(In reply to comment #5) > I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too. > > if isinstance(value, unicode): > value = value.encode('utf-8') |value| will always be a unicode string, won't it? Explicitly calling encode seems a bit better.

Dirk Pranke

Comment 8 2010-09-14 18:16:46 PDT

(In reply to comment #7) > (In reply to comment #5) > > I would slightly prefer something like the following. I feel like it's more explicit. But this is fine too. > > > > if isinstance(value, unicode): > > value = value.encode('utf-8') > > |value| will always be a unicode string, won't it? Explicitly calling encode seems a bit better. I assumed Ojan's comment was pseudocode out of context. There isn't actually a 'value' variable anywhere in the code AFAIK.

Eric Seidel (no email)

Comment 9 2010-09-14 19:05:46 PDT

Comment on attachment 67626 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160 > + expected_filename = str(expected_filename) > + actual_filename = str(actual_filename) This will take expected_filename and try to decode it as ascii. Is that what you intended? If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want? (At least that's my understanding.) > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:175 > + port.diff_text('exp', 'act', 'exp.txt', 'act.txt') > + port.diff_text('exp', 'act', u'exp.txt', 'act.txt') > + > + port.diff_text('exp' + chr(255), 'act', 'exp.txt', 'act.txt') > + port.diff_text('exp' + chr(255), 'act', u'exp.txt', 'act.txt') Seems we need a test with unicode characters here. I don't think this is 100% right, but please correct me if I'm wrong. :)

Dirk Pranke

Comment 10 2010-09-14 19:24:20 PDT

(In reply to comment #9) > (From update of attachment 67626 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160 > > + expected_filename = str(expected_filename) > > + actual_filename = str(actual_filename) > This will take expected_filename and try to decode it as ascii. Is that what you intended? If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want? (At least that's my understanding.) > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:175 > > + port.diff_text('exp', 'act', 'exp.txt', 'act.txt') > > + port.diff_text('exp', 'act', u'exp.txt', 'act.txt') > > + > > + port.diff_text('exp' + chr(255), 'act', 'exp.txt', 'act.txt') > > + port.diff_text('exp' + chr(255), 'act', u'exp.txt', 'act.txt') > Seems we need a test with unicode characters here. > > I don't think this is 100% right, but please correct me if I'm wrong. :) The second line gives us the test with unicode in it. Really, we only needed those two. The third and fourth test to make sure that we're *not* doing an ascii conversion, because those would choke on a non-ascii character (the original test contained non-ascii characters in the expected and actual strings).

Dirk Pranke

Comment 11 2010-09-14 19:25:40 PDT

(In reply to comment #9) > (From update of attachment 67626 [details]) > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160 > > + expected_filename = str(expected_filename) > > + actual_filename = str(actual_filename) > This will take expected_filename and try to decode it as ascii. Is that what you intended? If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want? (At least that's my understanding.) > Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters).

Mihai Parparita

Comment 12 2010-09-14 19:27:32 PDT

(In reply to comment #11) > (In reply to comment #9) > > (From update of attachment 67626 [details] [details]) > > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch > > > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160 > > > + expected_filename = str(expected_filename) > > > + actual_filename = str(actual_filename) > > This will take expected_filename and try to decode it as ascii. Is that what you intended? If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want? (At least that's my understanding.) > > > > Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters). However, Eric is right that str() on a unicode string with non-ASCII characters will thrown an exception. Using filename.encode('utf-8') (as Ojan suggested) is safer. Will upload a new patch.

Dirk Pranke

Comment 13 2010-09-14 19:35:24 PDT

(In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #9) > > > (From update of attachment 67626 [details] [details] [details]) > > > View in context: https://bugs.webkit.org/attachment.cgi?id=67626&action=prettypatch > > > > > > > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:160 > > > > + expected_filename = str(expected_filename) > > > > + actual_filename = str(actual_filename) > > > This will take expected_filename and try to decode it as ascii. Is that what you intended? If expected_filename is already a str() object it has no effect, but if it's unicode() that's effectively calling filename.encode('ascii') which may not be what you want? (At least that's my understanding.) > > > > > > > Yup, that's what we wanted, except that we actually want "turn this into a raw string so that I can safely join it to other raw strings". We don't want to force every line to ascii, or we'd still run into the problem we have now (expected strings containing non-ascii characters). > > However, Eric is right that str() on a unicode string with non-ASCII characters will thrown an exception. Using filename.encode('utf-8') (as Ojan suggested) is safer. Will upload a new patch. I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in.

Mihai Parparita

Comment 14 2010-09-14 19:43:51 PDT

Created attachment 67635 [details] Patch

Eric Seidel (no email)

Comment 15 2010-09-14 20:16:40 PDT

(In reply to comment #13) > I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in. We have in the past had non-ascii file names. I believe they had to be removed when the windows port came online. I remember adding some since, but then again having to remove them since they broke on windows. But the general goal of supporting unicode urls and filenames is a good one. We should assume unicode whenever we're dealing with "strings". The str() type in Python was renamed ByteArray() in Python 3. and unicode() became the default for "foo". The u"foo" syntax no longer exists in Python 3. This is part of how we always want to use codecs.open for dealing with files and always pass an encoding. codecs.open became open() in Python 3, again as part of moving to real unicode support throughout the interpreter. (I've not used Python 3, I've just read the unicode powerpoints/docs online as part of the grand "make webkitpy understand unicode, dammit!" project I undertook 9 months ago to make Tor Arne Vestbø not break the commit-queue every time he was involved with a patch. :)

Mihai Parparita

Comment 16 2010-09-14 20:23:35 PDT

Comment on attachment 67635 [details] Patch Missed a return statement in to_raw_bytes

Mihai Parparita

Comment 17 2010-09-14 20:27:29 PDT

Created attachment 67639 [details] Patch

Tony Chang

Comment 18 2010-09-14 20:58:56 PDT

(In reply to comment #15) > (In reply to comment #13) > > I'm not sure that we actually want to allow non-ASCII filenames, so I'm not sure that this is a bad thing. Lord knows what else would break if we did allow them and one gets checked in. > > We have in the past had non-ascii file names. I believe they had to be removed when the windows port came online. I remember adding some since, but then again having to remove them since they broke on windows. But the general goal of supporting unicode urls and filenames is a good one. Where do these filenames come from? On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s. Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames. If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters.

Mihai Parparita

Comment 19 2010-09-14 21:44:44 PDT

(In reply to comment #18) > Where do these filenames come from? On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s. > > Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames. If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters. I just wanted a string with not-valid-in-ASCII characters in them, I got it from http://docs.python.org/howto/unicode.html#unicode-literals-in-python-source-code. If the filenames are not unicode (and just a stream of bytes, i.e. str), then this will degrade fine.

Eric Seidel (no email)

Comment 20 2010-09-14 22:36:33 PDT

(In reply to comment #18) > Where do these filenames come from? On Linux, filenames are just a stream of bytes so you can't always assume unicode and calling encode('utf8') is going to produce "\xef\xbf\xbd"s. Mac HFS+ is always utf8 for all paths. But I realize that is not true for all platforms. We've had the non-ascii file names at various times because we need to be able to test handling of non-ascii URLs both locally and remotely for WebKit. There are also handling unicode-aware parsing algorithms for element attributes, etc. In general it's not a huge issue, but we ended up removing these tests due to Windows troubles iirc. > Ultimately, I don't think it matters what we do here because for lots of reasons, we're only going to have ascii filenames. If you want to try to future proof the code, I'm not opposed, but I don't think it actually matters. Agreed. Still is nice to attempt to be sane about unicode. I think the confusion comes here because I read the patch too quickly and didn't realize Mihai's char(255) were how he was getting unicode code points. I'm used to using u"foo\u1235bar" to get unicode codepoints in python.

Eric Seidel (no email)

Comment 21 2010-09-14 22:39:29 PDT

Comment on attachment 67639 [details] Patch View in context: https://bugs.webkit.org/attachment.cgi?id=67639&action=prettypatch I really need to write some documentation for dealing with unicode vs. byte arrays in webkitpy. > WebKitTools/Scripts/webkitpy/layout_tests/port/base.py:162 > + def to_raw_bytes(str): > + if isinstance(str, unicode): > + return str.encode('utf-8') > + return str I suspect other areas of the code I've assumed that file names were unicode strings. If that's not a valid assumption across platforms we're going to have more of these bugs. We may need to make a module-wide fix to this. > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:170 > + # actual input is always a raw string, not unicode). I would have said "raw bytes" instead of "raw string", but that's a total nit. > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:180 > + # be safe safe.

Mihai Parparita

Comment 22 2010-09-15 08:58:00 PDT

Created attachment 67681 [details] Patch

Mihai Parparita

Comment 23 2010-09-15 08:59:08 PDT

(In reply to comment #21) > I suspect other areas of the code I've assumed that file names were unicode strings. If that's not a valid assumption across platforms we're going to have more of these bugs. We may need to make a module-wide fix to this. It's a valid assumption, until we try to combine filenames with their content. Diff output is most likely the only place where we try to do that. > > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:170 > > + # actual input is always a raw string, not unicode). > I would have said "raw bytes" instead of "raw string", but that's a total nit. Fixed (in the ChangeLog too) > > WebKitTools/Scripts/webkitpy/layout_tests/port/base_unittest.py:180 > > + # be safe > safe. Fixed.

WebKit Commit Bot

Comment 24 2010-09-15 10:06:22 PDT

Comment on attachment 67681 [details] Patch Clearing flags on attachment: 67681 Committed r67560: <http://trac.webkit.org/changeset/67560>

WebKit Commit Bot

Comment 25 2010-09-15 10:06:28 PDT

All reviewed patches have been landed. Closing bug.

WebKit Review Bot

Comment 26 2010-09-15 10:41:34 PDT

http://trac.webkit.org/changeset/67560 might have broken GTK Linux 32-bit Release The following changes are on the blame list: http://trac.webkit.org/changeset/67560 http://trac.webkit.org/changeset/67561

Mihai Parparita

Comment 27 2010-09-15 10:47:48 PDT

(In reply to comment #26) > http://trac.webkit.org/changeset/67560 might have broken GTK Linux 32-bit Release > The following changes are on the blame list: > http://trac.webkit.org/changeset/67560 > http://trac.webkit.org/changeset/67561 Since accessibility/ellipsis-text.html failed, I assume it's r67561's fault.

Attachments
Patch (2.90 KB, patch) 2010-09-14 17:47 PDT, Mihai Parparita	no flags	Details Formatted Diff Diff
Patch (3.51 KB, patch) 2010-09-14 19:43 PDT, Mihai Parparita	mihaip: review- mihaip: commit-queue-	Details Formatted Diff Diff
Patch (3.86 KB, patch) 2010-09-14 20:27 PDT, Mihai Parparita	no flags	Details Formatted Diff Diff
Patch (3.81 KB, patch) 2010-09-15 08:58 PDT, Mihai Parparita	no flags	Details Formatted Diff Diff
Show Obsolete (3) View All Add attachment proposed patch, testcase, etc.